j. ED 211 592 

AOTHOR 
TITLE 

INSTITUTION 

SPONS AGENCY 
POB DATE 
* GRANT 
, NOTE 



EDBS PRICE 
DESCRIPTORS 



IDENTIFIERS 



DOCOMENT BESOMS 

S • 



TM 620 025 



McArthu'r, David, 

Test Design Project:! Studies in Test Eias. AnnUal , 
Report. *. \ " » 

California Oniv., L<ps^ Angeles J Center /fcr the Study 
of Evaluation. X ' . 

National Inst., of -Education (ED),' "Washington, . D A C • 
1 Nov 81 ; • * 

NIE-G-80-0112 *«v '« 

&7p. r for related documents se,e TM 820 C24 and TM 820 
026. - - r 




BF0I/PC01 Plus Postage. . . • . 
♦ Biling'ual Education: Bilingual" Students; Elementary 
Education; *Ethnicity; Non English .spewing ;• v . \ 
♦Research Methcdolcgy; ,*Statistical Analy^s; *Test 
Bias; Test construction: Hrit ing. "Evaluatic^v ' 
Comprehensive Tests of Basic Skills . 



• 1 



ABSTRACT , 

Item bias, in a multiple-choice test can .be' detecte' 
by appropriate analyses of £tie persons x items scoring! natrix. 
permits comparison of groups, of examinees tested i.with , the same 
instrument. The test may be biased if it is not measuring the same • 
thing^Ln comparable groups,, if groups are responding tc different 
aspects of the test i£ems, or if cultural and' linguistic, issues take 
precedence. An empir ical^study of the guestion of -'bias as shown' By 
these technigues was conducted. 'Five Belated schemes fcr the 
statistical analysis of bias jaejTe applied to the comprehensive Test ' 
of Basic Skills which was administered id either the English cr 
Spanish- language version .'at two levels or--e-lementar y school in 
bilingual, education programs.., The objectives, measured «ere recall or 
recognition ability, 'ability to translate 'or 'cjon vert 'verbal ' or 
syabolic concepts, ability to comprehend concepts, "ability tc apply 
techniques, and, ability to extend interpretation beyond' stated 
information. The results ' indicated that' several itens in the tests 
sfhowed strong evidence of bias, corroborated' by a« separate analysis 
of linguistic and cultural' soupces cf bias- fop many' itens* 




(Author/DHH) 



/ 



\ 



9 * 



* 



„ ***** ****************************************** ******************£* 

- > '** '^productions *srupplied by EDRS • ar,e the. best that can be made^ 
' * . * ' • from tne original document. :" 

*********************** *********************** ************* ** A ******** 

v o • ■ - ... - • •. . 



* 
** 



le Study of Evaluation 



UCLA Graduate School of -Education 
Los Angeles, California 90024 ^ 




U A DEPARTMENT Of- EDUCATION 

NATIONAL INSTITUTE OF EDUCATION k 
EDUCATIONAL RESOURCES INFORMATION t 

* CENTER (ERIC) 
£J Tha documenf has been reproduced as 
received from the parson or organization 
-originating rt. 4 
□ Minor changes have been made to improve 
, reproduction quality. * 

• Points of view or opinions stated in this docu- 
ment do not necessarily represent official NIE 
position or policy. N 



: * 




■■ ■■ 




PERMISSION Tdm^ODU CetHIS . , _ 
MATERIAL HAS BEEN GS^raBY^^^^ 



ERIC 



TO THE EDUCATIONAL RESOURCES » ' 
INFORMATION CENTER (ERP, M i ' , 

t 



■ J 




DELIVERABLE - November 1, 1981 
TEST DESIGN PROJECT: STUDIES IN TEST BIAS 

Annual Report • 

... „ f 

David McArthur, Study Director 



Grant Number 
flIE-G-80-0112 



CENTER FOR THE STUDY OF EVALUATION 
Graduate School of Education . 
University of Califoicnia, Los Angeles 



TABLE OF CONTENTS 



•Introduction to the Test Design Project Studies in Test Bias 

* « x . D. McArthur 



Detection of item bias using analyses of response patterns (Appendix A) 

D. McArthur 



Performance patterns of bilingual children tested in both languages 



(Appendix B) 



? 



-D/ McArthur 



Bias in the writing of prose and its appraisal (Appendix C) 

D. McArthijr 

Potential sources" of bias in duaf" langua-ge achievement tests 
(Appendix D) . B. Cabell o 



Cultural interference in reading comprehension: An alternative 
explanation \ (Appendix E) B. Cabello 



4 V 



» « 



Introduction to the Test Design Project Studies in Test Bias 

The assessment of literacy. in bilingual and limited English proficient 
(tEP) students is a distinct problem area in applied ps^chometrics. Non- 
native speaking students, some with generally impoverished language skills, 
others simply weak in English, constitute a substantial proportion of the 
student population in many, regions of the Unil^f States. The instruction 
and assessment of these students is an issue of national concern. Even after 
placement into monolingual Englistucl asses, cultural group characteristics 
continue to interact with the instruction these students receive and to 
influence their performance on tests. Lower levels of * performance by bilin- 
gual and LEP students, whether from a test given in English or given in 
their native language, may come about either because schools 'cannot provide 
appropriate instruction or because the instruments used to assess student 
competencies unfairly underestimate their ability levels. 

The Test Design Prpject studies in test bias were initiated by the . 
Center for the Study of Evaluation with the belief, that "bias" in assessment 
occurs in both the nature of the test and the situation within which the 
te'st is- given. These studies have used four primary approaches to identify 
and interpret such bias.. The first asked whether the use of translated 
tests constitutes a viable strategy for assessing non-English speaking 
pupils. A well-regarded test of childrens' academic competencies, .the 
Comprehensive Test of Basic Skills (CTBS), is widely used in both its 
original English version as well as a recent Spanish-language translation; 
' would the CTBS and the CTBS-Espanol prove as successful as is claimed in 
being free from bias? This is a volatile issues not only from thV viewpoint 
of statistical analyses but also in terms of the number of separate- impacts 



2 

c ■ 

such tests have on education. The second approach was to determine whether 
the current schemes for isolating test and item bias could be successfully 

» 

applied to datasets %ich might not necessarily meet the various theoretical 
and practical strictures those techniques require. Moreover, a substantial 
line of inquiry not intrinsically statistical in nature, i.e., content 
analysis and linguistic analysis, was -seen as central to the task. of isola- 
ting both the fact and to some extent the taot caijses of bias. The third 

primary focus of effort in the analysis pf test bias was to see whether new 

> • > 

methods, potentially more direct and more amenable to use in the field', 

could be as successful in detection and interpretation of bias as prfevious 

state-of-the-art analytic schemes which have been suggested. One strong 

objection to many of those schemes has been that few of! them yield unambi- 

guous information about bias, and most are substantially more complex in 

their execution than might be desirable.. A fourth approach, related to the 

1 

others, examined the likelihood' that tes£ bias also occurs in alternative 
forms of testing, such as the free .writing of prose in response to a prompt. 
Here the potential for bias extends not only to the questions of individual 
examinee performance but also to the performance of the persons who rate 
the essays afterwards. 

Following a. year's planning. and data acquisition, the second year of 
the study was devoted to four separate analyses addressing the goals noted 

above: . 

♦ * 

(l) Conventional analyses of bias including the approaches suggested 
by classical test'theory and newer multidimentional scaling methods, coupled 
with pursuit of simplifications to the statistical analysis of bias, 



including, first, the model of partitioned variances and, second, the Student- 
Problem (S-P) methods in use in Japan (Sato and Kurata, ,1^77; Sata, 1980); 

2) Analysis of selected aspects of item content, the extSnt-ojM^nguis- 
tic, cultural and social bias imbedded witlrin item stems and. answers, and 
the quality of translation of the £TBS from its original English version to 
the Spanish-language* CTBS-Espanol ; 

V* p 

3) Analysis of a selected dataset within the <±e$t, bias project which 

contained scores of both the English- and-Spanish-langua^t versions of the 
CTBS from the same set of students, judged by their teachers to be equally 
competent in both "English and Spanish; and 

4) Analysis of ratings made by Hispanic and non-Hispanic raters using 
i — » 

an objective scoring system, who reviewed a special set of essays generated } 
by groups. of Hispanic and non-Hispani t c primary school students. 

Certain aspects of analysis suggested in the original proposal for this 
project contained in some detail in interim reports were initiaily very 

appealing as possible routes for optimizing the detection and interpretation 

. a' 

of bias., 

They are the partitioning of variances, log-linear analysis, and multi- 
dimensional sealing even when the necessary initial specifications are not 
known with sufficient accuracy, however, in the long run each- of these 
proved to be theoretically problematic, relatively cumbersome and statis- 
/tically unwise in their application to the test bias questie^ 

Partitioning of variance into a between-class and a within-class • 
component formed one major aspfect of the original effort, and was presented 
• in some detail in the November 1980 deliverable. The primary intent was 
to utilize such partitioning for. each item to reveal patterns of bias_ 



4 



through the interpretation of both relative sizes of the variances and 
their correlations^!* th total te%t score and popular distractor answers. 
However, substantive theoretical and. practical arguments became evident as 
work progressed. The first is th&t the 'exjimi nation of within^class varia- 
tions within items runs a large risk of violating assumptions of homoscedas- J 

s. 

ticity, and such violations can'not'be resolved independently. Secondly, 
between-class variations are actually non-orthogonal to within-clasS 
variations except in certain situations. A desirable index of bias, would 
be one which utilizes information from that portion _of between-class vari- 
ation which excludes all other portions of the variance, but this index 
proved to be intractable in practice. Additionally, calculations £f 'effect 
size in relation to variations jn class size contain some unsolvable unknowns,, 
Log-linear analysis has been successfirtly applied to a number of studies in 

sociology and was considered as a viable tool for analysis of bias until it 

*> 

was determined that the stability of computations involved in this technique 

f 

was questionable with the sample sizes. of the data sets available. Addition- 
ally,' the interpretation of results in the context either of specific items 
or specific examinees (and thus a logical route to the isolating of item 
bias) was hampered by requirements for secondary analyses following the 
initial solution. .The presence of inadequate* sample size profoundly 
affects the utility of multidimensional scaling in these investigations. Like- 
wise, the issue of computational indeterminacy was a noticable hinderance to 

effective solutions using that method. When certain rigorous conditions for 

* - *# 
specification of' initial parameters are met, multidimensional scaling may well 

prove as effective in detecting, bias ^s other techniques. 



8 



■ ERJC , , 



« 



'The S-P/ method, first discussed in the November 1980 deliverable N appears 
. to hold a number of possibilities for effective and unambiguous analysis of 
" * b'ias. . It is a highly versatile contribution to the field of testing from 

Japan, and contains minimal requirements on sample size, prior scoring, 
item sealing' and the like. The S-P model lends itself to extensions into 
nondichotomous scoring and multiple pattern analysis, as well as the possi- 

> - 

bility that the role of guessing in .achievement test scores can be analyzed 
effectively! In the main, the most recent efforts of this .project have been 
directed at isolating- sources of patterning in examinees' responses as a , 
function of test items an c d distractors, student abilities' and backgrounds, 
*' -and 1 their interactions* . Towards this end, the S-P method, coupled with a 
small" number of other techniques, has proved singularly successful at the 
- task of determining degree of item bias, and -the method of content analysis 

has proved to be an important contribution to understanding the language 
j used .in the test. v » . - 

- The analyses conducted during 'the year have been prepared for publication 
as follows: . 

McArthur, D.L. -Detection of item bias using analyses of response patterns, 
Summer, 1981. Submitted for publication to the Journal of Educational 
Measurement (Appendix A). , • t x 

McArthur, D.L' -Performance patterns of bilingual children tested in 
bpth languages, Summer, 1981. Submitted for publication to the 
Journal of -Educational Measurement (Appendix B). 
McArthur, D.L: jBias in the writing of prose and its appraisal, Fall, 1981. 



To be submitted (Appendix C) 

x • 



9 

ERJC 



Cabello,.B. Potential sources- of bias in dyal language achievement tests, 
Fall, 1981. ^To be submitted to TESOL Quarterly {Appendix D). . 
Cabello, B. Cultural interference in reading comprehension: An alternative 

v 

explanation, Fall, 1981. Accepted^for the 'Annual Meeting of the 
» « '__.*. 

California Educational Research Association, San. Diego, November,. 1981 

(Appendix'!?) K ' * • 



Portions of these papers also have been accepted, for the Annual Meeting 
of the Arrerican Educational Research Association, New York, Spring/1982. 



/ 



10 



S. 



■ere 



-DETECTION QF ITEM BIAS USING ANALYSES 
-OF RESPONSE PATTERNS 



David L. McArthur 
estiter-for the Study of Evaluation 
University of California Los Angeles 



.4' 



r 



; ^ 



Supported- by a. grant from the National Institute of Education 
$NIE-G-80-0012). Appreciation* is extended to Beverly Cabello 
former analysis of cultural and -linguistic issues. 



11 



Abstract 



Item tr?a|, when present in a multiple-choice test, can be detected * 
by appropriate analyses of the persons x i teThs scoring matrix. Five 
related schemes for the statistical analysis* of bias^ere. applied to a ~ 



widely used primary skills^mul tiplerchoice test which was administered^ * 
in 'either its" English or Spanish-language version at each of the two levels 
1259 students in-bilingual education programs. The results indicate that 1 
from one-fifth to one-third of the i.tems in the tests show strong evidence, 
of bias, corroborated by a separate analysis of linguistic and .cultural 
sources of kias for both the biased items and those items with no statis- 
€ical findings of bias. t -'..*' 



J 



y 



• 



• • • J 
' - v 



12 



A systematic but unanticipated pattern of fespot^s6s to a multiple- 
choice test found for an entire group of te§t-takers is generally regarded * 
as evidence otf bias. This interpretation results from\ indications of 

one or morg differences between groups on levels of knowledge and skill, 

* \ * 

or in linguistic, and cultural issues related to the use\of language in . 

the test. , However, the 'behaviors of individual respondents have, important 

consequences for that interpretation. Whether the responaent unerringly * 

picks the correct response, or successfully engages in elimination, of 

incorrect answers, or guesses well, the observer scores the item "correct". 

and concludes that the stu^nt "knows" the required skills or material. 

TheJ nference that the respondent "does not know" is made whether he/she 

guesses incorrectly, eliminates wrong choices badly, or chooses an . , * { 

attractive but incorrect alternative. 

.Mo§t, likely, what look like systematic patterns of bias in test 

items are the results of complex interactions of these group and individual 

'factors with one another and with certain properties of the test items. 

What is required to make sense of the .issue of bias is Analysis of, patterns 

found in th^se combinations of performance. The multiplicity of possible 

* 

patterns suggests. that the detection and interpretation of bias must be ^ 
conducted alonjg several routes. 

> Goals of this r&earch 

The first of two purposes of this paper is. to investigate analyses 
of the persons x items scoring matrix of a test for the detection of item/ < 
bias. The persons x items scoring matrix contains a "significant amount of 
infdrmation about the patterns of responses generated by a set of examinees. 



Using a'f£w geometrical and statistical considerations., >the patterns 'of 
responses from separate groups' of examinees testecf with^the same instrument 
can be compared* If these patterns show that the test is not measuring 
the same thing— skills, competence-, thinking. abilities— in comparable" 
'groups, if. the .groups are responding to different asp^bts of tfie test 
items, or if cultural and/or linguistic issues take pretedence, it may 

* A. ' 

be that the test is biased. ; * - * • . « " . 

The second .purpose of this paper is .to Study empirically the question 
of bias as shown by these several techniques in the context of a widely 
used achievement test, the Comprehensive Test of Basic Skills (CTBS), 
which has been translated from English into Spanish. 'The claims made 
-about this instrument include that statement thaft the* Spanish-language 
version represents a close replicate ^rf the' English-language version witfi^ 
careful attention having* been exercised in removing all forms of unintended 
bias. The primary task, of this analysis is to ascertain* the degree of 
comparability 'Of the, two versions of the CTBS in t;he assessment of similar 
groups of chi"Mren ? and to s^u any bias remains. 



Related literature* 



toeratt 



* A substantial research literature has developed around the term 
'"item bias*" in. the search- for a single b ( est ap-purpose indicator whtch^, 
.^always reveals bias whenever systematic^discrep^nci^ * 
between groups are found. A large number^ of methods have been proposed 
.and a, -large number of studies conducted Ccf . reviews in Berk, in press; 
Subkoviak, Mack and Ironson, 1981). Certain tests such as the WecKsler ' 
Intelligence Scale for Children have been extensively investigated 



(of, Sandoval, 1979). - The range of/ applications of 'the term "bias" is 
quite broad: studies >have examined sociocultural bias. and the stereo- 
typing of items and answers, cultural* differences, cmd linguistic vari-r 
ations (tf. Jensen, 1980)^construct bias and the different aspects of 
♦ ■ performance tapped in differentdfexaminee groups by the same test1[cf. 



fcbel,^1975); a'nd contextual bias and the misusejof tests -with specific 
groups (cf. Wiljiams, 1971); Occasionally the word isyeyen used to mean 
a conscious 'preference on the part of the examinee (Hudson, 1963). 

^Increasingly complex techniques have been set forth for the v 
detection of bias in items 1 . Methods' have been based on\nalysis of 
variance, transformed it&n difficulties,- factor techniques, adjusted chi 
square procedures, distnactor analyses, "adverse impact" and item charaqt^r- 
istic curves (Merz, 1980; Petersen, 1980; Rudner, Getson and Knight, 1980) . 
Many of these methbds ,are statistically complex but, with the exception b% - 
the last, statistically inelegant (Hunter, 1975); unfortunately the most 
elegant solution, item 'characteristic curve- analysis, requires large numbers 
of items and respondents for its computation. Few of- these approaches 



offer' convincing or useful explanations of why some items are biased and 



ers are not (Crowder, 1979}. Faced with the 1 multiplicity of both the 



forms of item bias and the statistical methods that have been put forward 
J^to detect such bias, one logical place to begin is to inquire about the 

v ' 

nature of a test wftich is absolutely fre6 of bias. 



An unbiased* test 

If a test could- be credited which fulfilled all of the requirements of 
a bias-free instrument, its items would all measure the same trait or 



9 

ERIC 



J5 



ability and be equally reliable and equally valid for all groups (Petersen, 
1980). It would als,o show orderly Variation in the relative difficulties 
of the items, and be responded to in an orderly manner by every individual. 
One example of the, outcome of this improbable creature is the familiar 
perfect Guttman scale, in which persons are perfectly ordered by increments 
of skill, level, and items within the test are perfectly ordered by incre- 
ments of difficulty. No higher-level item is mastered \by any respondent 
until each lower- level ftem is mastered; guessing also plays no role. 
The sequence of successes and failures is highly deterministic. 

Figure 1A represents 'a ten-item test with right/wrong scores. fo^ 
ten respondents. These ten persons never successfully answered a more 

difficult item without first having succeeded on a less difficult item. 
An axis of performance can be drawn on the diagonal to separate all/ correct 
scores from aTl incorrect scores. While the total p-value for the test ^ 
is lower for another group of ten persons tested on the same ten items, 
shown in Figure 1 B, the performance patterns are parallel. Other than, a 
main effect due to groups, nowhere in either diagram" is any indi cation ^of, 

* \ - 

a systematic unexpected difference in the pattern ot responses or bias 

.in the test. , fc 

r ■ % 

A slightly bigsed test 

A somewhat less artificial example of test results from a multiple- 
.choice test is shown in "figure 2A;^the sco$e matrfc 0 a^^pothetical 



' ( Insert figures' 2A anjd >ZB about here 

- — - - w~«w-»r-" 



terKitem test has be v erf sorted by both perspns, on Ascending total score, 
and by* items, on ascending level of difficulty. Neither persons nor 
items' is 'perfectly ordered in the sense used above * and guessing # of 

r 

correct answers probably contributes by an unknown amount to the scores 
obtained. Not one but two dividing lifies are^now required to separate 
the patterns of performance in this figure. The fjrst line, a cumulative 
og^ive representing studeni performance, is drawn on the matrix based on . 
the total correct score for every respondent. The second, representing 
problem difficulty, is drawn as "a cumulative ogive based on item p- values. 
Note that for a test which demonstrates exclusively* random responding, th6 
theoretical position of the student curve (S-curve) woujld be vertical, 
and of the -problem curve (P-curve) , .horizontal. 

'At this juncture we .introduce a second setfof data obtained from the m 
same hypothetical test." The "respondents 11 were slightly, less capable on, 
most items but all other considerations were held equal. A score matrix 

l 4T 

for the same set of items as shown in Figure 2A but the second group of 

examinees is shown in Figure 2B. The relative order of items is somewhat* 

changed because of differing levels of difficulty; the second group 

performs less well overall than the first group. . Statistical differences- 

between the data in Figures 2A and 2B should reflect overall item and 

group differences* but because of the idealized symmetry between the two, 

there is little likelihood that a statistical indicator of bias would 

prove significant. An initial analysis of , these figures jredommended by 

Jensen (1980) is a two factor (group x i, terns) nested analyst of variance. 

The interpretation of a significant groups effect, in the.absence of 

r 



0 

other significant factors, is that the groups behave symmetri tally wi-th 
respect to ordering of item difficulties but that one group is consistently 
more capable across' the trait^being appraised by this test. A significant 
difference on both the groups and items factors, plus a significant irtfer- 
action between groups and items, together suggest N that 'the test items and 
examinee abilities in the two groups are heterogeneous. 1 However, these 
findings would be qifite insufficient to say that the test is biased 
(Hunter, 1975), and, additionally, do not account for the contribution . 
of guessing. 

A second approach recommended by Jensen. (1980) for understanding the 
differences between\he two figures uses the phi coefficient, which is 
the correlation obtained between the group response to a given item and ' 
the saoie group's response to any other item, in the test. Phi is. a 
measure of joint contingency; Jensen explains its use for analysis of 

Lye ' ' i 

bias: ' . ' 

Only if "the two items have the same difficulty. . .tan phi be 

equal to 1 To determine the intrinsic correlation (of the 

items) free of the influences in item difficulty, we must di- 
vide the obtained phi by the maximum value of phi that could 
.possibly be obtained with the given marginal frequencies(p.431) . 

The rati a of phi to maximum value of phi is summed over all possible pairs 

'of items for each gr.oup, and then the ratios are compared. The null 

hypothesis for this comparison is that the difference between the obtained 

sums is not different from randomness, and- thus there is no systematic 

discrepancy in group performance. In the artificial situation shown by 

the Guttman scale- for both groups in Figure 1, this test is necessarily 

nonsignificant. For data which does not fit the mandates of a perfect 



. 18 



scale, the obtained value for the comparison of ratio sums increases a$ 

the discrepancy in, overall patterns of response by the two separate groups 

widens. 2 While the amount of difference between groups is given by the 

ana-lyses"of variance and phi, the nature of patterns of response to items 

is not adequately explained. 

' Only a small, number of statistically-based analyses specifically 

designed to study patterns of responding to multiple-choice tests have 

been proposed. Tatsuoka (1981) and Harnisch and Linn (1981) have been 

working on a norm conformity index and other parameters, which address each 

individual's performance in the context of patterns obtained by all members 

of the group. ' Sato (1980) defines an index of disparity between actual •* 

and' ideal response patterns which can b*e applied to individuals or to 

♦ 

items. To unravel the problem of patterns, we now' tur^i, to Sato's system 

IK - 

of analysis of the persons x items matrix. 

The S-P method and analysis of the persons, x items matrix 

The key element in Sato's (1980); S-P method of analysis of test 
performance is the doubly-ordered persons x items matrix^ with student 
curve (S-ctirv'e) and problem curve (P-curve) drawn in. In Japan,' this 
procedure is widely used in classrooms- to obtain the characteristic 

performance of the set of examinees, which may be compared visually to 

' 3 * 

several "standard" curve functjohs far diagnostic purposes. • 

Sato has developed an index of discrepancy to evaluate the degree 
to which the S and P curves do not conform etther to one another or to 
the Guttman scale. Except in the case of the -perfectly ordered sets 



19 



shown" in Figure 1, there is always some degree of discrepancy between 

curves* The index is explained as follows: 

.D* - A ( N, n, gf) where the denominator 
A B ( N, n, p) - • 

• i.is the area between the S curve and the'P curve in the 
given S-P chart for a group of N students. who_took rr-prob]em 
test and got an average problem-passing rate p, and A»( N, n, p ) * 
is the area between the two curves as modeled by cumulative 
. binomial distributions with parameters N, n, and p, respectively 
(Sato, 1980, p. 15). ■ 

0 The denominator is a function which expresses a truly random* pattern 

of responses for a test with a given number of subjects, given number of 

items, and given average passing* rate, while the numerator reflects the 

obtained pattern for that test. As the. value, of this ratio approaches 

1.0, it portrays sn increasingly random, pattern of responses* Tor the 

perfect Guttman scale as represented by Figure 1, the numerator will be * 

4 

0 and thus D* w1.ll be 0. - , 

Indices of discrepancy, when computed for each of two groups of 
examinees, may not be statistically compared because of differences in 
ranking of item, difficul ty, and/or compound differences in response 
patterns to several-items. However as long .as the two D* values obtained 
are not equivalent, it is an indication that somewhere within the matrices 
are one or more items which are behaving dissimilarly across groups. 

• ' a 

" Analysis of respondents above P curve 

Patterns of discrepant performance result from a mixture of random 
behaviors and wrong choices, except for those items which anfcso easy 
that no respondent gets them wrong. Aside from the tautology that res- 
pondents with less ability are less likely to answer a given item correctly 



alTother things, being equal they are also likely to use chance responding. 
Analysis of those respondents who ar,e* unlikely to be answering randomly 
would seem, a likely means to understanding 'patterns ,apd bias in items. 
To begin constructing a simple analytic scrlution to this, problem, suppose 

we take a single uncomplicated item from the S-P chart, and famine the 

j 

pattern of respo nses for only that portion of the same group of examinees 
for whom the prediction of success is relatively high, i.e.,- those above 
the P-curve. These are the examinees who tended to score better overall. 
Specifically,- respondents at the very top of this select subgroup are 

0 

expected to have had a finite but small probability 'of having guessed 
their way to success. Respondents at the bottom of this select subgroup ' 
would have a finitely .larger probability^ while those iX the very bottom 
of the entire S-P chart would be likely to have a more random pattern. . 

If the selected item, however, J is one for which no individual within . 
the sample, no matter how skilled, is able to answer knowledgeably, the 
response pattern among the select group of putative "masters" should be.* 
random, and should not differ from the response pattegfl of those examinees 
-not included in this subgroup. For a four^dioice item of this kind, the 
item's p-value shoulcTbe about .25, and the select subgroup of putative 
"masters" would be correct only 25% of the time. Figure 3 illustrates . 

t 0 

a pattern of responses for a nearly random item, in contrast with an 1 
item which is fairly well -fitted to the skills of a set of respondents. ', 

Insert Figure 3 about here 
* The proportions of "masters" who are indeed correct can be compared _ 
between groups. With relatively uniform variances, the test of significant 
difference in independent proportions applied to this- problem fields a z 

' 21 ' / 



V 

8 , * 

' " - , ' 11 



score; a Significant z score would be in indication of possible bias 
separate from the difference 'in average passing rates for that item, if 
any. A comparison of nonuniform variances requires transforming the item 
difficulties into standard score form, tl|en testing 4ie size of the _ 
difference following Rudner, Getson *and Knight, (198Q) .Jithin certaip 

*■ i 

n • 

limits, an item which is relatively easy for one group and relatively - . 
difficult for another may show no bias in the "proportions 9 of "masters" 
who are correct, because A those individuals who place above- the P curve 
all have the ability to answer that .item, correctly. However, on another 
item one of two> groups may not be? academically equipped, or may be 
prevented by responding by biases' Jn the $est, Curriculum or culture; 
'ihusjthe proportions may differ," possibly by an amount sufficiently large 
to be deemed significant. ' . 

Analysis of distractors " » . \ 

""One further analysis of the potentially biased item is to examine .. 

• theSpatterris of wrong answers made by the separate groups of respondents. 

Within thex multiple choice test format, differences between groups in 

^the attractiveness of incorrect responses signal that? the item's wrong 

« -. 
choices may be differentially distracting. When a given item has 

• '.<,' * 

attractive but -incorrect responses for one group,' Goodman and KruskaTs 
lambda., indicates whether another group shares the same proportional j 
pattern of selecting those incorrect responses (Veale and Forman, ,19F6). 
Lambda is an index of predictiv#association, .which shows "...how one 
is, led to predict differentially in light of the- relationship. . ."(Hayes, . 
1963, p. 610, italics original). It is calculated for a problem. in- 

." : ■ ' ' • • " , & ■ " ■ ; 



\ . - 

volving two groups by evaluating the largest discrepancy between rates 



responding to similar wrong choices: 

max - f jk - 

N - max-f 



^ s l max.fjk - max.f ^ 



,k 



where max.f.. is the larger frequency of the two groups for any single 

wrong choice, and rnaxJ" . is the larger marginal frequency^of the two 

• K ^ . ^ 

groups summed across all wrong choices* m . ' 

If Goodman and Kruskal's lambda, is appreciably above zero, the intfer- 

t . 

pretation can be made that the pattern of distraction is different for the two 
groups. If the i-nde^ js zero, even though the difficulty of the i tenwad/pr the 
proportions who select a wrong option may differ between the two groups, 

~~ C * 

the pattern of selecting the wrong answers is about the same. • % 

Another check on the relative attractiveness .of a wrong answer can 
bo fiade by counting the number of wrong answers which are chosen at least . 
10% more often than the next most popular wrong answers.' These particular 
wrong choices constitute a class of "popular abstractors," each of which 
can be studied further. The easiest comparison is between those items 
for which both groups picked the same popular distractor and those items 
for which both groups picked different popular di stractors-. Note that 
in this latter case, the computation of lambda will always yield a nonzero - 

value. * - - ' 

A series of analyses of "item bias has been described, with special 
attention paid to those comparisons premised on the persons x items scoring 
matrix, doubly sorted. The following sections describe the execution of ^ 
these analyses in the context of a multi-language achievement test. 



Method „ - 

Instruments * 

• For a study of the possible bias inherent' in a multi -language test, 
,two levels ; of the Comprehensive Test of Basic Ski Vis (CTBS) published by , 
CTB/McGraw Hill (1974, 1978), were administered^ this study. 1 Students^ 
in grades.2 and 3 we're given the CTBS Level C; participating fifth and ; ' 
sixth grade students took Level. 2. CTBS-English Level C is designed' for 
students in grades 1.6-2.9; CTBS-Spanish Level C is designed for students 
in grade 2. CTBS-English level 2has a target population Jn grades 4.5 to 
6.3; the Spanish translation .was designed for students in- grades' 5 arfti 6. 

The CTBS-English and CTBS-Spanish tests were selected for several 
reasons. Test content is roughly parallel. The CTBS-Sp'ahish was the 
first test at CTB/MqGraw hIh to be subjected to a four^step editorial 
procedure designed to reduce test bias; included were studies of content 
validity, application ,of editorial guidelines in item construction, re- 
views for bias, and separate ethnic group pilot studies with the test. 
In the translation of the CTBS from English to Spanish, the test developers 
tried to keep the' test content and measurement features intact. This, of 
course, meant^hat in some cases word-for-word translations were not 
possible. Nevertheless, it was the intent of the publisher to provide x 
tests 9 that are similar in rationale andFin the process/content classifi- - 
cation scheme. Thus, both the English- and Spanish-language versions 
used in this study purport to measure the following objectives: 

1. "the ability to recognize or recall' information 

2. the ability to trati&late or convert concepts from one 
kind of^ language (verbal or symbolic) to another 



24 



. 14 



./ 



3. the ability to comprehend concepts and their-interrelationships 

4. ' the ability to apply techniques, including performing 1 

operations. 

5. the ability to extend interpretation beyond stated infor- 
* mation (CTBST, 1974/1978) S 

Test 3ength> test time.andlidmins.tration procedures are exactly 

the same, for Engl isti and Spanish versiorts of each test level. 

"Subjects ■ . ' ' * • 

Five school districts ^n the statV of California panticip&ted in the 
study. The total'number of/pupils tested was 1259, representing 81 intact 
classrooms. > 

•ciassrooms were selected to represent a wide range of program options. 

• ' ^- n 
-The criterion for selectfbn of school districts was that they had bilingual- 



bicultural educati on , programs funded either by Title VIII or by the ESEA.. 

r 

Potential participants were identified from schools listed in the California 

State Department of Education' 1979 Bilingual Program Directory. x From 

» 

this list, invitations were sent to schools which had at least two classes 
at the same'gradeTevel (grades one, two, five, or six) having bilingual pro- 
grams." Additionally, instruction had to be delivered in self-oonta^ ned, multi 
subject settings; departmentalized or pull-otit programs' were excluded. 

■ Analyses ^ - - 

Five statistics explained above were used to evaluate the da,ta for 
every item separately. . Each uses a minimum threshhold value, above which 
*the result is taken as an indication of possible bias>1n the-.item. The 
analyses and their minimums can be "summarized as follows: 



15' 



» 

a) Test of proportions of correct scores ; «» across grdups, a dif- 
ference between transformed p-values which generates a z>1.96; ^ 

- b) Test of proportions of correct scores for "masters" :- across " • » 
^ • t groups, a difference between proportions of those respondents s 

J- ' . . , above the P-curve who make ejrfors, which generates -a z>1.96; ' % 

c) Test of chance responding, by "masters" : within each group, a difference 
/ .' between the obtained proportion of those passing the item and 

a theoretical p-value of .25, which generates a z<1.96; 

d) Test of differential attractiveness of wronft answers : a ' 
Goodman jand KruskaV lambda computed on the proportions ot ; 

" incorrect answers by choice within item, such that x*>(L0; 

\ * 
i . ' . . " e) Test of popular diitfeactors : a wrong choice for an item attracting 
- . at least 10% or more responses than the next most popular 
~ wrong choice for that item. 

M - _j • ■■ 

J Results 

The number of items within each subtest by 1-evel, and the number of 
students in each of two language groups who "were included are shown at 
4 the top of Table 1. Item p-values indicate that items ranged from moderately ^ 

— — — — — — — — — — — — — — — — — — ^ 

'Insert Table 1 about here • 
easy to very difficult for both language groups, with a overall mean of 
' somewhat ov.er.half of the items_ correct. " While in a few items the Spmsh- 
language group did better, without exception the Spanish-language groups 
' "always scored lower overall on the subtests. In every instance the maximum 

p-values achieved by the English-language groups are slightly higher^ than M 
the comparable scores for the Spanish lanugage groups. Table 1 also shows 
" ' ' for the corresponding number of students, the p-value heeded for a 

• -significant (p<.05) difference from chance responding to an item. This 
„ - figure is obtained by reversing the usual computation for the"test of 

♦ ^ 

independent proportions, using z = 1.96 and P cnance = -25. For 'all but one . 
of the subtests, both language groups had one or more items which appear 



7 



^ . : ' 16 



to represent. random choice of the correct answer. Except for the Passage 
Comprehension subtest at Level C, the Spanish-language group .appears to 

» • * • 

make random selections more fif ten. than the English-language group, an 

assumption ;which is, explored further below. 

* v * 

For purposes of illustration, two analyses recommended by Jensen 

(1980) were conducted on the subtest with- the' smallest number of items, 

Level C Passage Comprehension. The two-factor nested analysis of t 

vetriance for this subtest shows £ significant effect due to the groups 

factor (F (1»65Q) « 54.91, MSerror * 1.37), and a significant effect due 

to the interaction between items and groups (F (17,11Q50>= 2.61, MSerror = 

/ 

0,43). The ratio of phi to phi-maxis higher for the English-language 
sample than for the Spa'nish-language sample (English mean $/>-ma,x =\8207; % 
Spanish,mean '♦/♦-max = .7666, t (151) = 4.01, p<.0i). This brief 'set of 
findings indicates only that -the language groups are not performing the same 
way as one another on the subtest. It seems that the Spanish-language- 

sample may have had more difficulty with soite items than did their English- .. 

j- ' • • , ' . ' 

language counterparts. No further detail can be' learned' from these analyses, 

• . ' ' • < */"* 

and**they are not-used in' the study of the remaining subtests. 

« • • 1 f 

' The S-P< charts, were drafted for each subtest by 1 anguage. group for 
.a total of eight complete charts. The index, of discrepancy D* is presented 
in the last row of Table 1. The fact that .the D* values are. Higher for the 
Spanish-language groups suggests. that they.engaged in patterns closer to 
chance responding more often than did English-language. groups. While the 
differences between pairs o| D* values are large for the Passage Comprehension 
subtest at both level C and level 2, th^se values cannot be compared further. 



17 



The specific reasons why the Spanish-language versions generate larger 
D* values jen only be made evident with further analyses. 

Results from. the set of five analyses which together provide 
sufficient evidence of patterns of discrepant performance are presented 
below and in Table 2. The table shows percentages of i>tems for sach.of , 
the four* subtests in this study which exceed a critical minimum on each, 
of the five analyses. • > 

Test of proportions of correct scores, he first of the concise * 
set of analyses is the test of proportions, which is applicable to per- 
centages of correct answers expressed 'in standard score form, for both 

groups on each item of each subtest. - The first two rows of Table 2 show 

* ,i 

the percent of items favoring the English- or S^ish-language groups. 
Six out of every ten items inUha Vocabulary subt^Ots show significant 

Insert Table 2 about here * 

differences between groupV; in a-majority of instances the higher group 
is always the English-language group. Half of the items in the Passage 
Comprehension subtest at Level C show a significant difference and over 
three-quarters' of the items in that subtest at l^evel 2 show a significant 
difference; in no instance are the Spanish- language groups ahead of their 

English-language counterparts. 

, ■ < . 
Test -of proportions of correct scores for "masters , 11 Both the 

second and third analyses in this set are based on* the selective sample 

of "masters," those students whose overall scoring position'places them 

... 

above. the P-curve for each item. By. evaluating tjje, proportions of correct 

\ • 

scores for those members of th6 language groups, a list of statistically 
stiefnificant discrepancies between "masters" is generated. The third and 



.fourth 'rows of Table 2 show the percent of items within subtest for which 
the success rate among "masters" is significantly higher for the English- 
lajiguage. or Spanish-language groups. The Passage Comprehension subtests 
at both levels appear to have different rates at which the "masters" are 
ab^e to avoid the«wrong answer; in the majority of instances the rate is 
higher fpr'the English-language groups. In the Passage Comprehension . 
"subtests, the rate is. uniformly higher for the English-language groups. 

Test of chance responding by "masters." How often the samples df 
"masters" are not able to choose the correct response at a rate better 
than change, forms a third part' of the analysis.. The fifth and. sixth rows 
of Table 2 show that for the Level C subtests, no items are found for 
which either group responded randomly. However, for. Level 2, a- small 
nunbet* of 'items in both subtests elicited cfhance responding by "masters". 
These items appear to be so difficult that not' even the better students 
cpuld knowledgeably\^>elect the correct response. The Spanish-language 
group has a much larger number of chance responses among "masters'" 
the English-language groups on the Level 2 Passage Comprehension subtest. 

Test of differencial attractiveness of wrong answers , Yhe fourth 
analysis in this sequence is the analysis of differential patterns of 
incorrect responses. Goodman and Kruskal's lambda was calculated for 
each i£em, using a 2 x 3 table of groups by incorrect response rates. 
Values ranged. from 0.0 to .23, with a. median of 0. ; Lambda will be 0 for 
arty :2 x 3 table* of proportions for which both groups are attracted to 
the same response, even if the' actual dimensions of those attractions 
differ drastically. As there /is no exact test of significance, any non- 

. :*v • • • • 



zero lambda was considered to be an indicator of possibl-e bias. The seventh 
row of.Table 2 shows the- percentage of items within each subtest for which 
a nonzero lambda was found. The ratio of such items to the number of 
.iterts within subtest ranges^from 1:4 to 1:2, 'suggesting that, when wrong 
answers were selected the two languages groups often behaved very "differently:* 

Test of popular distractors . The concluding analysis- in this series' 
asks whether there are any incorre&t choices which were sufficiently 
attractive to be classed as popular detractors. In the .final rows of 
Table 2 are shown -the percentage of items which meet the 10%-or-greater 
criterion for the English-language groups, the Spanish-language groups, 
and jointly across groups. Except in Passage Comprehension at level 2, 

/ 

the Spanish-language group's results show more' i tems with popular distractors 
than the English-language grodp. Percent joint overlap is of particular 
interest, since that value gives another indication of the uniformity 
of .behaviors across language groups when selecting incorrect responses. 
In the subtests in this study, the joint overlap of popular distractors is 
very small, suggesting -again that manyjtems of the English version of the 
test and the Spanish, translation may not be as comparable as the test 
designers intended. . • 

The degree of overlap between the-five.anajyses in terms of the number* 
of positive findings for each subtest is shown in Table^ 3. The 4 

Insert Table 3 about here 
percentage of items for which none of the- preceding analyses show evidence 
of bias 1s remarkably small. LeveT C Passage, Comprehension, for example, 
has only' a single item which never shows a difference between Jthe language 



♦ 

groups. Over half of the items in that subtest have at least two positive 
findings, and four of the items have three positive findings. Table 3 
shows that the percentage of items for which three^four, or five out of 
five statistical indicators yield positive results varies from about one- 
fifth to about two-fifths of the items within each subtest. 



Content analysis ' 

On the basis of the preceding evidence from the statistical approach 

,to bias, detection in the CTBS, those items which show agreement of three or 

* ■ ' 

more indicators were subjected to a careful analysis of item content. The 
content analysis was a search for possible linguistic, curricular, and/or 
cultural reasons which might explain differential* performance between lan- 
guage groups^ Thjs-porttoiic^ edu cational 

m *' 

researcher fluent in both English and Spanish, making extensive referetide 
to the curricular materials used by the students in the sample, and 
consulting with native speakers of various dialects s in making an appraisal. 
Five categories were tabulated as possible sources of influence which 

v 

item content might exert on the different language groups: 

a) Mistranslation: the meaning and/or grammatical form of a key 
word or phrase within the item was translated from the English ' 
original in a manner which is an, incorrect or inappropriate, use 
of the Spanish 0 language; , ^ 

b) Cultural bias: some key word or phrase within the item requires 
familiarity with objects, behaviors, or values which are not 
normally found in the Spanish and Latino cultures, or which may 
have very different interpretations; 

\ -- ■ ' ' ' ' 

c) Linguistic bias: some key word or phrase within the item requires 
familiarity with an' idiomatic expression or verbal allusions 
wtitcb', because of innate differences in .language, 'do not 

' translate well; 

* d) Low frequency word bias:, some key vjord or- phrase within the item is 
not. fourfd, or rarely found, in the basal readers used for • 
instruction by the students in our sample. 




e) Unfamiljar context bias: some key word. or phrase within the 
item appears in a context which is quite different from that 
found for the Word or phrase in the basal readers used for 
instruction. 

An example of item content judged to bias respondents is shown by 
item number 29 of th$# Level C Vocabulary subtest, an item for which all * 
statistical indicators point' to. possible trouble* Item 29 (rated as 
category c, linguistic bias) requires the student to sel'ect a synonym 
for *"happy.". The English-language version of the test yielded responses 
which appear significantly disadvantaged on' this particular item. While 
the* correct option for r this item in the Spanish-language 'version, /alegre/, 
wa$ selected 60% of the time by our sample, the correct pption in the 
English-language version, /gay/, was selected only by 13% of the sample. 
The English-language respondents instead split their selection equally 
b.etween two of the remaining options, ofly one other item in the entire 
test set received as strong a rejection, suggesting that among second and 
third graders, the slang English-language meaning for •gay 1 has not only 
rendered it useless as a synonym for 'happy 1 but has given it a strong 
pejorative flavor as well. 

Table 4 shows data for items in each of the four subtests^for which 
Insert Table 4 about here / \ - ( 
the content analysis identified probable sources of bias/^fne entries in 
the table represent tabulations of the- content analysis categories for^those 
items on each subtest which have -three or more statistical indicators. 

For the Level C Vocabulary subtest, twelve items have at least three 

» * * * ~~ * — ** 

statistical indicators; nine of those twelve show evidence of linguistic 



J 32 



22 



bias, and five of the nine show evidence from an additional category of 
content bias as well. Three of the four items from the Level C Passage 
Comprehension subtest fit at least one of the categories of- content 
bias, two of them with multiple indicators. Only, four out of nineteen 
on the Levels Vocabulary subtest items with three or more statistical 
indicators do not have ostensible problems as shown by the content analysis 
procedure., Of twenty-one item* in the Level 2 Passage Comprehension 
subtest with three or more indicators, only three cannot be corroborated 
by the analysis of content. None of the items in any subtest which had 
no statistical indicators of bias were found to have any content indicators 
of bias. 

Table 5 presents a summary of subtest performance by group when 
those items for which three or more statistical indicators turn up positive 

Insert Table 5 about here - 
are excluded. In three of the four subtests, the adjusted scores of the 
Spanish-language groups move closer to-thejr English-language counterparts. 
A substantial difference remains, however, between scores for the Passage 
Comprehension subtest at Level 2. The gain from initial to adjusted 
group mean by the Spanish-language group is quite insufficient to raise 
tha a t yalue. to the level x)f the English-language group. The adjusted 
minimum p-values achieved by both groups jnove upward but the English- 
language group pulls ahead noticeably. 

.5 * 

Discussion 



Five relatively' timple analyses have been presented which point to 
five related considerations in the search for bias. These are a) overall 
group differences and their. direction, b) differences, in performance by ar 



. 23 . 

-' ■ " . * 

select subsample of better respondents within groups, c) differences from 

chance responding by those subsamples, d) differences between groups in 
the selection of wrong answers,, and e) degree of detraction provided by 
wrong item choices. The first of these follows the welT-known Anghoff delta 
procedure (Anghoff, 1972)% without resorting to the arbitrary, use of 
resdraling, whiclnsimply serves for added convenience. The. second and 
third analyses make ifte of the select subsample of putative- "masters", 
those students within each group whose overall performances place them 
above the P-curve; these approaches are extensions of the work of Sato 
(1980) and colleagues. The fourth and fifth procedures examine the bias 
question by studying those parts of the multiple-choice item which are 
usually excluded from .study in a right-wrong scoring context (cf. Powell 

and Isbjster, 1974). 

For purposes of this paper, the five procedures are considered 

jointly, with equal weights* Interpretations of bias are confirmed in 

the clear majority of cases where the joint indication of three or more 

* j 

statistics is found for an item. Certain problems remain to^be solved, 
h'owever, and therefore some conditions must be placed on the use of this - 
set of approaches to^the detection, of item bias. It is clear, for example, 
that the first index, because it is based on proportion of correct items, 
is to be used with caution: "proportions of-correct answers in a group ' 
of examinees is not really a measure of item difficulty. This^proportion 
describes not only the test item but also the.groupr tested" (Lord, 1980, 
p. 35). Indeed, throughout it must be remembered that the results of 
this study are descriptive of this sample only, and no external criteria 
are available to evaluate comparability across language groups by grade. 

, .34 - ' '. 



1 .\ 

' A second objection is that the psychometric properties of the CTBS 

items are only partially 'expressed by reliance on p-values and the S-P 

chart, which at its core relies#oa the index of item difficulty. 

Thus, the conclusions drawrj fr/m work with that chart are. only as good 

as the strength of the item difficulty metric. In addition, the S-P 

chart suffers from o'ther nitric problems- The first is that the doubly- 

sorted persons x items matrix treats data, in part, as. interval father- 

than continuous 'data. Thus, for instance, subtle gradations of .difficulty 

may be given the same credanc'e as larger differences in -the case where 

p-values are nonuniformly distributed. Analogously, nonlinear 'distributions 

of total performance scores may contribute in. unknown ways to the use made 

of ranking information regarding respondents: the patterns may not be 

as smooth as the chart makes them appear. Moreover, as. the S-*P chart 

approaches randomness and its index' of discrepancy, D*; approaches- 1.0, 

increasingly complex but hidden interactions between the properties of 

the items in .the tlst and the attributes of the sample are likely., Thus, * 

the second and third statistics in the analytic set depend upon certain 

assumptions about the nature of performance pattens, violations of which 1 

bear rather unclear consequences. Related problems appear in item' char- - 

acteristic curve analysis (Linn, Levine, Hasting, and Wardrop, 1980), 

and in*the "adverse impact" approach (Merz, 1980). 

A third objection to the procedures used in this study centers on 

issues of*guessing. In the absence of an externally valid explicit ' 

' i " ■ * * ■' — 

.criterion, correction for guessing does not seem feasible (Choppin, 1974),- 

Yet assumptions about the occurence and distribution of guessing affect ' , 
all aspects of the analysis, particularly statistics, which address incorrec't 



responses. Volitional bias, quite likely contributing to the anomalous 
response by the English-language -group ,to item 29 orr the Level C.Vocabu- 

V • t • 

lary subtest, is nowhere adequately considered: flow much of a role 
guessing plays is not well' treated by the assumption that chance responding 
is represented by p = .25. In thd very likely event that some members 
of any group will engage in guessing some of the time on some items* only 
the most general and simplistic cQnclusions can be drawn "from the data 
presented here. One problem of particular note is the strong possibility 
that guessing assumes a gradiant distribution within the person x items- 
matrix. That is, from the most capable to the .least capable person,, the 
contribution of guessing on any item may move from relatively low proba- 
bility to relatively high probability, thus potentially, interfering with 
diagnosis of problems inherent in -the item. But^such diagnosis lies,' at 
the heart of the effort to deciph'er and describe item bias. Until the 
gradiant problem is separate^ from the bias problem, only partially 
satisfactory conclusions can t^e "drawn about- either. 

On the positive side, the/high level of match between content analysis 
and the aggregate of statistical evidence suggests that this simple rapproach 
to bias detection may have as mucfi viability as more laborious and unwieldy 
procedures. The ease of computations and interpretation!?, and the 
parsimony' of explanation are also* favorable points (Nerzj -1980). While 
some attempt is made in the preceding pages to demonstrate the tise of 
multiple indicators, mor*e possibilities can be pursued* within this frame- , 
foork. The explanatory power of tie five-part ^procedure appears to exceed 
that offered by analysis of variance or phi /phi -max, and tlte assumptions 



26 



required about the configuration of persons and items are fewer in number 
than those required by the modified chi -square analyses which recently 
have been challenged as inadequate (Marasduilo and Slaughter, in press). 

Comparison of the present set of results with those of more complex 
analytic procedures conducted on the same data set awai tjs. further study. 
However, unlike the results reported by Linn, Levine, Hastings and Wardrop 
(19.81), in which ftem characteristic curve analyses for a hypothetical 
dataset- "...did not lend themselves to making generalizations about 
features of items... (p. 38)," the findings of the present study suggest 
at least one concluding observation'. Many signals point to a primary 
conclusion that a number of items in the English-language and Spanish- 
language versions of the CTBS do not seem to be comparable. Across a 
spectrum of indicators, ih'e Spanish-language groups regularly produced 
lower scores. . In three of four subtests, removing* those items for which 
three or more statistical indicators pointed to difficulty gave^ adjusted 
scores* which were very similar ^between groups. . In the fourth subtest, 
that correction did .not yield significant improvement, suggesting that 
the Sp'ai^ish- language sample at grade 6 may be disadvantaged in some 
respect unrelated to the CTfiS itself. 



27 



Footnotes 



\ 



1 / 

• The comparison of Figures 2A and 2B yields only a- significant 
difference on the factor of items (F(9, 162^=13. 98, p<.001).' 

2 For the difference between Figures 2A and 2B, x 2 =8/0222,p<.01. 

' Direct interpretation of item scores, person scores, and the amount 
of discrepancy between the S and P curves is relatively easy to accomplish; 
the same holds for item analysis, individual performance. analysis, and other 
summary statistics within a group. In Japan, this system has been auto- 
mated using a microcomputer (Sato,* Takeya, Kurata, Morimoto ar/d Chimura, 
1981). 

4 In Figure 2A, D* = .2534; in Figure 2B, D* = .3747. 



V 

\ 



3& 



Table 1 N 
Summary of performance by subtest by group 



28 



Subtest 



Level C 



level 2 



Vocabulary 



Passage 
Comprehension 



Vocabulary 



Passage 
Comprehension 



Group 


English 


Spanish 


English 


Spanish 


English 


Spani sh 


Engl ish 


Spanish 


n items 


33 


18 


40 


45 


N students 
responding 


364 


286 


,363 


280 


378 


231 


377 ' 


' 203 


p value 
s.d. ' 


.6570 
* .1619 


.6212 
.1775 


' .6254 
.0874 


.5924 
.1139 


.5599 
.1473 


' .4302 
.1506 


.5225 
.1254 


.3832 
.1022 


maximum p 


.8571 


.8542_ ' 


.7356 


.7128 , 


'.8'568 


.7662 


.7507 


.6321 • 


•minimum p 


.1395 


.1538 


.4826 . 


.4088 


.2892 


^.2078 


.2366 


.1272 


minimum re- 
quired p 
greater ythan 
chance res- 
ponding . 


.2969 


.3033 


.2970 


^.3039 . 


.2960 
1 


.3096 


.2961 


.3138 


n items less 
than minimum 
required p 


1 


2 

V 


0 


0 


' 11 


2 

9 


11 


index of dis- 
crepancy d* 


.3408 


-.3568 


.2353 • 


.4690 


.441-6 


..4980 


.4741 


' ^6288 



29 



TABLE 2 

percentage of Items Exceeding 
Critical Hinimums in Five Analyses 



Subtest 



Vocabulary 



Level" C ■ 
Passage 



Level 2 

Passage 



Comprehension 



Analysis * 

a) Test of proportions 
' or'correct scores 

English significantly 
higher 

r - Spanish significantly 
V higher 

b) Jest of proportions of 
correct scores for 
"masters 11 

English significant! x 
higher 
• Spanish significantly 
higher 

c) Test of chance responding 
by "masters" 

in English 
in Spanish 

d) Test of differential 
attractiveness of wrong 
answers between groups 

e) Test of popular 
distractors 

in Engl ish* 
in Spanish 

Overlap between .groups 




9% 
30% 
6% 



30 



Subtest 

JJo indicators 
* • .» 
One indicator 

• Two indicators 

Three indicators. 

Four^indi cators 

Five indicators 



TABLE. 3 * . 

Percent of Items Showing Statistical 
Indicators pf Differential Performance 



Level C • 
Passage 
Vocabulary Comprehension 



Level* 2 

Passage 
Vocabulary Comprehension 



9% 


s*% 


- 23%' 


33% 


. 39% \ 


18% 


in 


33% 


, 18% 


27,55- 


11% , 


33% 






6% 


0% ' 


8% 


3% 


0% \ 


/. 0% 



4% 
20% 
40% 
34% 
2% 
0% 



r 



" . - •* ' Table 4 

Sources of content bias for items with three or more statistical 
indicators of differential- performance, by subtest 



> 



Key: a) test of proportions . 

b) test 6f 'proportions'^ correct scores for "masters" 

c) ; test of chance responding by "masters" 

d) test of differential attractiveness of wrong answers 

e) -test of popular distractors , ' ; 



1) mistranslation 

2) cultural difference 
3} linguistic difference 

4) _ low frequency wo.rd or phrase* 

5) unfamiliar "context for word or phrase 



Level C'VoeabuTary 

item 2 at, e; 
♦6«a t), d 

7 I* d e 

12 a b e 

14 a b e 

15 a b c e 

16 a b e 
20 b d e 
23 a d e 
29 a b c d e 

• 30 a b d e 
32 a b e 



rERIC 



3' 
3 
3 



2 3 



Level C Passage Comprehension Level 2 Vocabulary 

d 

1 Z 
2 3 4 



Level 2 Passage Comprehension 



item la b 
4 a b 



4 

* 31 
3 

2'3 
2 4 
3 

2 3 
3 



6^ 
7a b 



d ; 

d e; 
d' ; 



-it- - 



item 1 a b 

6 a b,' e 

8 a b" d 

9 a b • e 

11 a b e 

12 a b . d e 

13 a b e 
15 a b e 

19 a b c e 

20 a b d e 
- 23 a d e 

25 a b c d e 

26 a b c e 
.•32 c d e 

34 a c d 

35 a b c e 
'36 bee 

39 a b c d 

40 a b • d 



2 

1 -3 
3 4 

1 2 
1 

1 2 3 
2 

r 3 
2 

1«2 



• 3 
3 
3 



item 1 

J. 2 

3. 
7 
9 

15 
17' 
18 

• 21- 
• -22 
24 
25 
28 
29' 
34 

'36 
37 

. 38 



a b 

a ,b 

a b 

a b 

a ' 
a 



e 
e 

\d e 
d 

c e 
d e 



a b e 
a b c . e 
a b . d 
a.b d 
abed* 
abed 
a b c e 
a \ c e 
a ' d e 
a b c 

bed 
a b c e 
39 a b c e 
41 - a b d' - 



45 



a 
i 



4'5 
A 

5 

4 5 
4 5 
4 5 



4 5. 



2 
2 



4 5' 
4 5" 
3 4 i 



c d 



32 



Table 5 

Revised summary of performance by subtest group, deleting 
items with three or mora statistical indicators 



Subtest . ' Level C % Level 2 

' Passage Passage 
• • Vocabulary . Comprehension . Vocabulary Comprehension 



Group ' English Spanish English Spanish English Spanish English Spanish 

adjusted . „ j M »a 

n items » * 21 [/ . 14 21 24 

mean Sted .6804^.6606 .6216 .6061 ■, .5818. .5322 ,.54,31 .4067 

SnSlnal^ .0234 .0394 -.0038 % .0137 .0219 ' .1020 .1230 4 .0969 ' 

s?d" Sted ".1298 .1502 .0936 .1039 .1418 .1476 .0206 .0235 



adjusted^ 
maximum 

adjusted 
minimum 



.8571 .'8542 .7356 ' .7128 .8568^ .7662 • ..7507 .5707 
.4104 ' .3004 ;. .4826 .4343 .3344 .3005 .2366 -.1272 



Figure Captions 

• Figures 1A and IBf 1A) Perfect Gutfman scale for a hypothetical 
ten-item test scored right (1) or wrong (0). Persons and items are 
uniformly ordered, by total correct score and level of difficulty, 
respectively.* IB) Perfect Guttman scale,, showing uniform ordering 
with lower (perall peffoYmance. 

Figures 2A and 2B: 2A) Hypothetical score matrix for a ten r itenr 
test sorted by respondents on descending total score and by items on 
ascending level of difficulty. S- and P-curves reflect cumulative ogives 
of performance, and lead to an appraisal of the characteristic perform-, 
anceof the group. IB) Hypothetical score matrix for the same test with 
*a different group, again sorted by respondents and items. 

Figure 3: Hypothetical patterns of response to two items by ten 
persons, showing a poorly-fitted and a better- fitted item. 



45 



Total 
score 



FiERJC* 




% correct '100 90 80 



IB) . " 

Items 
Persons k 

\ L 
) 

. M 

\ N 

o" 

P 

Q 
R 
S 
T 



1 

1 



0 
0 
0 



0 
0, 
0 
0 



0 
0 

o\ 

0 

« 

0 



% correct- -.70 60 50 



•r* t: - 



70 


60 


50 


40 


30 


20 


10 




i 










p= ,5500 
s.d.= .3028 


K 

A 


C 

O 


6 


7 


Q 

o 


q 


10- 


Totqpl 
. score 


1 
1 


1 
1 


1 

1 


/ o 


^0 
0 


0 

, o 


0 
0 


7 
6 


1 


My 


M ) 


0 


0 


0 


o' 


5 


\y 




0 


0 


0 


0 


0 


4 - 




' 0 


0 


tf 


0 • 


' 0 


0 


3 


0 


0 


0 


0 


0 


0 


0 


2 


> V 

0 
* 


0 


, 0 


0 


0 


0 


0 


1- . 


0 


0 


0 


0 


0 


0 


0 


0 


°* 


: 0 


0 


0 


0 


' 0 ' 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


40. 


30 


20 


, 10" 


0- 


D 


0 : 


0 
















t 














5= .2800 
s.d.= .2616 . 
















• 



2A) 



Items 
Persons E 

.A 

. G 
C 

1 F 

j 

D 
H 
I 



p-value 



r 



' 55 



S-curve 



5 

•o; 
1 
1 
1 
1 



3 

1 

1 

1- 

0 

1 



9 
1 
1 
1 



10 
0 
0 



0 i 1 
0 



1.0 



0 



0 
0 



0 

1 

0 



0 
0 
0 
0 



a. 8 .8 

l-curve 
P-curve 



0" 
0 
0 
0 
0 
Q 
0 



.5 .5 .3, 



1 

0' 

0 

0 

1 

o 

0 
0 



6 ; 8 

i! i 



o 
i 

0 
0 
0 
0 
0 
0 



0 
0 
0 
0 
0 
0 
0 
0 



rr"T 8 

p- 

curve 7 




0 
0 
0 
0 
0 

!p 

0 
0 



.2 .2 .1 



7 
5 
5 
4 
4 
3 
2 
1 



2B) 



, P-curve 
p- value 



:" 


2 




4 


3 


5 


9 


M . 


1 




1 




*" i 


1 




. 1 


(\ ' 


1 




l 


0 


p 


'.0 




1 




0 


l 1 

1 



L 


1 




o • 


r 

1 i 


i 


• 0 


" N 






1 


i 

l i 
i 

0 i 


0 


0 


0 


.1 






i 


0 


S 


1 




si. 


. 0 


0 

0 


0 


T 


1 

• 


l 

. o , 


i 


0 


0 


0 


R 




_ i 
> °- • 


0 


1 


0 


0 


. Q 


■ -1 


; 


0 


0 


0 


0 




.9 


1 

I' 7 


.7 


i 


.4 


, .2 



S-curve 
10 ' 6 



S-curve 



1 



1 

0 
0 
0 
0 
0 
0 

0,. 
.2 



0 
0 
0 
0 

b 



0 
0 
0 



1 

0 
0 
0 

0' 

&o 

0 
0 

• 0 



8 P- Total 
— curve score 



0 
0 
0 
0 
0 
0 
'0 
",Q. 
0 
0 



.1 .1 .0 



7 
7 
5 
4 
4 
4 
3 
2 
2 
!• 



V " 



36 



3) 



Persons 



U 
V 
W 
X 
Y 
Z 
a 
b 
c 



Poorly- fitted 
i tern — 

0 

0 



P-curve 
, crosses here , 

1 rs i 



Better- fitted 
item • 

1 

1 



0 
0 
0 
0 

1 

0 



1 

g_ 

0 
0 

1 

0 



•P^curve 
crosses here 



'J 



48' 



• .37 

References * ' 

Anghoff, W. H. A technique for tfie investigation of- cultural , differences. 
Paper presented at the Annual Meeting of the American Psychological 
Association, Honolulu, 1972. t 

Berk, R. A. (Ed.) Handbook of nfethods .for detecting test bias . Baltimore; 
.Johns Hopkins University Press, in press. . 

Choppin, B. H. The correction for guessing on^ob.iective tests . Stockholm, 
International Association for the Evaluation of Educational Achieve- 
ment, 1974. 

Crowder, C. R. An investigation of item bias occurring at dif ferent ability 
levels for Anglo and Mexican-American students . Paper presented at the 
Annual Meeting of the American Educational Research Association, San 
Francisco, 1979. «. ' 

Comp rehensive Test of Basic Skills (CTBS) Examiner's .Manual and Espanol 
— Examiner's Manual. Monterey, CTB/McGraw-Hi 11 , 1974/1978. 

* 

Ebel, R.'L Constructing unbiased achievement tests . Paper presented^ 
V the National Institute of Education Conference on test bias, Baltimore, 
1975. <■ . "X. 

Harnisch, D. L. & Linn, R. I. Analysis-of item response patterns-:- \pnsis- 
te ncv indices and their application to criterion-r eferenced tests. 
Paper presented at the Annual 'Meeting of»the American Educational, 
Research Association,' Los Angeles, 1981. ; r 

Hayes, W. L.' Statistics . ^NeW York, Holt, Reinhart and Winston, 1963. 

Hudson, L. . The relation of psychological test scores to academic bias. 
. BVitish Journal of Educational Psychology , 1963, 33, 120-131. 

Hunter, J. E. A critical analysis of'the use of itenT means and item test- 
correlations to determine the presence or absence of content bias; 
in achievement test items" ! Raper^presented at the National. Institute 
of Education-conference on "test bias, -Baltimore, 1975. 

Jensen, A. R. - Bias in mental testing . KSw York, Free Press, 1980. 

Linn, R. L., Levine, M. V.., Hastings, C. N. & Wardrop, J. L. Item bias in 
J? a test of reading comprehension. " Applied Psychological Measurement, 
& 1981, 5, 159-173. 

Lord, F M. Applications- of item response theory to pract ical testing 
problems. Hillsdale, New Jersey, Lawrence Erlbaum, 1980. - - 

Marascuilo, L. A. ^Slaughter, R. E. Statistical procedures for analyzing 
item biasiifsed on chi-square statistics. Journal of E ducational 
MeasurawCTt, in press, 1981. 



1 



38 



Mere, W. R. Methods of assessing bias and fairness in tests . ARC Tech- 
nical Report #121-79, Sacramento, Applied Research Consultants, 
1980. (ERIC Document Reproduction Service No. ED 198 145) ' 

Petersen, N. S. "«Bias in the selection rule, bias in the test. In van 
der. Kamp, L. J. T., Langerak, W. F.& de Gruiter, D. /I. M. (Eds.). 
Psychometrics for -educational debates . Chichester, G. B.., John 
Wiley and Sorts, 1980-. ( 

Powell, J.- C. & Isjbister, A. £.« A comparison. between 'right and wrong 
• answers on a multiple choice test. Educational and Psychological 
Measurement, 1974, 34, 499-509. 



Rudner, L. M. , Geston, P. R., & KnTght, D. L. Biased i ^detection 
techniques. Journal of Educational Statistics fc 198^5;, 213-233. 

Sandoval J. The WISC-R andjnternal evidence of test, bias with minority, 
groups. Journal of Consulting and Clinical Psychology , 1979, 47, 
919-927. 

Satb,.T. The S-P chart and the caution, index. NEC (Nippon Electric 
Company, Japan) , Educational Informatics Bulletin , 1980\ 
' » "* "~. * ■ ', > 

Sato/T.,%Takeya, M., Kurata, M., Morimoto,-Y. SThimura, H. An instruc- 
tional- flata" analysis machine with .a microprocessor -- SPEEDY. NEC *' 
(Nippon 'Electric Company, -Japan) Research and Development , 1981, 
* - .Nd«,.61,;55-£3i L • - * • * X 

Subkoviak, C J.-^ Mack; 'J*. L , S^Ironson,' G; H:>- Item bias detection 
■ procedures: empiri'cali^Hdatjpn . :Paper presented at the Annual 
Meeting of the American Ktacatfon/l ■.Research'" Association, Los Angeles, 
• " 1981. " ■ " *->' '.' ' % ' ' 

Tatsuoka, K. An approach to assessing the seriousness of* error types .,, 
. Paper presented at the Annual Meeting of the American Educational 
Research Association, Los Angeles 1 , 4 1§81. f* * 

Veale, J. R.-, & Foreman, D. I. . Cultural variation'in' criterion referenced 
tests: A. global item analysis . .Paper presented at the Ann^l Meeting 
of the American Educational Research Association, San Francisco, 1976. 

Williams, R. L. Abuses and misuses fn testing black children. Counseling 
Psychologist , 1971, 2, 62-77., ^ . • • 

* 



50 - 



Bias in the Writing of -Prose and Its Appraisal 
David L. McArthur 
Center for<the Study of Evaluation 
University of California Los Angeles 



Supported by a grant from the National .Institute of Education 
(NIE-G-80-0012), . Appreciation is .extended to Edys Quellmalz 
and Frank Capell for theirsroles»in the planning of this study 
and to Chi Ping Chou and Beverly Cabello for their roles in 
the analysis. ' 



» 

Ab stract 
* 

Evidence from a variety of sources suggests that systematic differences 
can be found in the ratings 'given to student essays as a function not only , 
of the student's skills but also of aspects of both the student's back-, 
ground and the background of the rater. Additionally, the nature of the 
prompt which provided the central theme of the essay might bias the out- 
come of the ratings of that espy. A study of ratings of fifth and sixth 
graders who wrote paragraph-long essays in response to two topics presented 
either in written or pictorial form is presented. Students were! classified 
as Hispanic-surnamed or non-Hi spanic-surnamed; two teachers, trained as 
raters using an objectively-based essay scoring scheme, represented an 
Hispanic cultural background and two a non-Hispanic background. Results 
from a blind' rating of 100 complete essays show that several of the rating 
subStfcales were significantly influenced by an "interaction between student 
ethnicity and rater ethnicity, and several subscales by rater ethnicity 
alone. Student ethnicity alone was not a significant main effect on any 
subscale. Prompt modality is significant f or jone subscale, and interacts 
with'rater ethnicity on one other. The findings are interpreted as a 
direct -indication of biased assessment. „ „ • 



The evaluation , of writing of prose by schoolchildren poses special 
problems in relation- to bias in educational .appraisal . Many, factors have 
'long been l^nown to have major influence on the prose writing. performance 
.of minority pupils? The literature on the issue ; of biases which occur in 
the judgmertt of students' written worJ< is much smaller,. and has proved much 
more contradictory^ Are there specific aspects of non-native English 
writing style which undermine the usual 'procedures for j^dg/ng writing 
performance? Do raters who match the cultural background of the vJf^ters 
whose work they judge arrive at different conclusions from waters who do 
not share the same background? In the present paper, the results of a t ( 
research study involving both writers and ^readers from two*different cultures 
are examined in an attempt to partition ou,t the sources of systematic 
bias in the evaluation of writing. • , . - 

•sSources of'b ias: student variables 

— 

An overarching concern \'n the literature about bias^in writing has 
been, the isolating of sociocultural factors in students' background's .which' 
contribute to differences i*n performance. A ha If -century ago, Caldwell 
and.Mowry (1933) demonstrated that bilingual Hispanic children were at aj j 
disadvantage* due to their use of language compared to their monolingual 
English-speaking counterparts wheh evaluated by the essays they wrote; on 
objective examinations -the differences were not nearly as acute. .Parallel 
findings emerge from the recent large-scale study by Whit^e and Thomas (1981), 
• who combined files of data regarding entering students -in the California 
State. University a«d Colleges' system totyield jraphic comparisons of total 
scores for 5,246 whites, 585 blacks, 449 Mexican-American^*, and 6b Asian- 



Americans on two English placement exams. The first was the. CSUC's own 
English Placement Test; the second was the Test of Standard Written -English 
'from the College Entrance Examination Board. Although no statistical analys 
were presented, profiles of the four distributions suggest that a dialect ( 
interference or second' language interference hurt the overall performance of 
the three minority- samples on both tests* Lay (1978) has shown that native- 
•speaking Chinese studenrts are at a- disadvantage in writing English prose 
because of the wide differences in structure and phonology of English and 
Chinese. Rizzo and Villafane (1978) have-'shown that a similar explanation 
applies to native Spanish-speaking students. 

': Many investigators of language have *fh own that' structural aspects of . 
*both. oral and written language are significant in determining how children 
process the world around them. Moreover, many of the rules which govern 
functions of sending and receiving- meaning Using oral language are signi- 
ficantly different from those for -written expression (QTson, 1977). For 
the non-native speaker of English the task of\writing in English poses a 
particular problem- because f \. 

v.. the surface structure qj writing js an inadequate ; 
• representation of both the sbund structure of the target 
language-and its meaning. Learning'the underlying structure 
of the target language is as much of a, bootstrap operation 
as the initial process of learning a mother tongue (Smith, • 
1975, p. 359). - • x ' 

' One practical outcome of such a structura^'viewpoint is that students who 
fail to acquire skills in the underlying -structure of English^ might do' 
passably well with spoken Engl ish^but probably will have great difficulty^ 
with writing. Another factor >iot to be dismissed lightly is the attitudinal 



•or psychological readiness of the student to orient positively to 'the task 
of acquiring skills >in a. new language (Cervantes, 1975; Lambert,- Gardner, 
Bank, & Tunstall, 1963), Without .the necessary motivation and appropriate . 
learning context, students may "be unable to let their knowledge of both the ;* 
mother tongue and the new language interact to their advantage. 

Sources of bia s: evaluation variables 

* " * 

Beyond the issues of students' involvement in languages lies an ». 

important realm of educational and psychometric considerations having to 

do with the quantity and quality of appraisal. The nature of the task, 

.how it is interpreted by both the student and the teacher, with- what tools 

the students' writing is judged and by whom are all issues of import. 

In each of these lies the possibility- of systematically different patterns 

■ r ' 

of response for students from culturally or linguistically different groups. 

* Each, then, may introduce its own bias into the evaluation of writing. f The 
' purpose of the. writing. task usually given to students in the classroom is 
to construct an essay following a particular prompt*. The teacher seeks a 
sufficient amount of this writing to rate the 'quality of the student's work. 
Exactly what elements are most important in that assessment of writing is „ 

" often dependent upon the persons creating the scoring system. Freedman 
(1979) attempted to specify "definable parts" of student compositions ^ 
which influenced teacher judgments. She concluded that content, organization 

* * » * 

*The prompt itself may contribute to systematic bias^ Some students- may , 
not know whafthe prompt represents because they do not completely under- 
stand the vocabulary of the prompt In written form, or do not recognize the 
pictorial content (the palmtree vs.Wrgreen. problem) % Differences of 
an extreme nature are found in recognition of three dimensional objects" 
'in photographs or drawings between children of developed and underdeveloped 
countries. Subtler problems of prompt recognizability abound: one British 
picture recognition test for the primary grades .depicts electrical items 
common in England but totally unknown- in America, 



and language mechanics were the most important factors, in that order. 

The effect of "weak" content was so powerful that it overshadowed teacher* • 

judgment, in every other category. The interaction of content quality 

judgments with the quality of th6 writing* prompt is one point where bias 

in assessments-is possible. \ 

The rise of incompletely ejcpTfcated Scoring criteria introduces. another 
« # 

potential for bias in writing studies/ In Rhodesr-Hoover and Politzer f s (1974) 

Study of teachers' attitudes toward Black rhetoric, teachers downgraded * - 

compositions in the category of "language mechanics" because students failed 

to use "super standard" English. For example, if a student wrote, "I got^ 

there" as opposed to, "I reached my destination," the passage was considered 

too cQlloquial. Teachers not dnly gave their own interpretation of "usage" 

and "colloquial" but also imposed an undocumen table degree of severity 

in the'ir judgment that may or may not have been intended by the scale. 

In a study comparing the syntactical characteristics of Mexican and 

* 

Anglo-American* .prose* Rodri§ues (1978) a asked educators whether they 
could detect "slight" or "noticeable" differences in the prose syntax of 
the two groups. At least 95% of* these- educators found some difference; 
44% said they found "noticeable" differences. More Ang la-American educators 
found "notice^blef 1 differences than did Mexican-American raters . Bikson 

I v 

» I 

(1977) conducted a study of differences in working lexicons* of 72 lower 
v grade and 72 uppet^jrade White, Chicano and Bl tf ack elementary school students. 
Results showed that ethnically diverse* speakers made different kinds of . 
lexical choices, particularly, in the early grades. The differences 
between Anglo OeXi-con and either the Black or Chicano, lexicon were greater 
than the differences between the two minority lexica. The study found 



56 



5; 

varying degrees of overlap between minority and Ancjlo wor4 choice . The , 

minority students used a wider range of vocabulary than the- Anglo group, 
* 

but this "broaden" wording vocabulary isnot often valued by persons evalua- 

ting the speech of 'these students. 

._pi Inferences in cl a ssifi cation 'of lexical terms between different 

linguistic groups may have consequences . for the selection of scoring 

• * . v • 

criteria to evaluate the writing of these groups. Tf we take concept 

classification task's to be analogous to organization tasks in the writing 

process', then the different strategies used to associate words may reflect 

different preferred methods of essay organization. If the scoring criteria 

implicitly prefer one type of content organization strategy, such preference 

'could result in bias against those students who adopt alternative strategies. 

Two studies in particular seem to suggest that words are sorted by different 

.» • .*>* * 

ethnic groups into categories according to-different classification strate- 

/ • 

gtes Rissel (1978) studie'd the vocabulary-semantic relationship for monp- 
lingual English speakers, monolingual Spanish speakers, and^ Spanish/English 
bilinguals living in Mew YorJ< and Puerto Rico to determine the classification 
strategies of these groups. ~'The .study fourtd that not only did the cUssi- 
fication strategies vary by linguistic group but that there appeared to be 
a relatio/is'hip Bfetwe'en 'amount of language dominance and classification 
strategy* Sranish dominant bilinguals employed comparative criteria, whereas 
tfiejnore "barahc'ed^ fl^jnguals used comparative classification for Spanish 
words and inclusive classification for English. Stahl {1977) conducted 
a study comparing the "methods for arrangement" 4 of content used by Israeli , 
students of European -or Arabic extraction. He found that those of European 

. background tende'd t'o^arramg^ tfie content in a hierarchical or inclusive * , 

* ^ * . 

manner, whereas thqse. of Arabic background tended to use more associative- 
.or comparative techniqt/es. An interesting aspect of his method was that 

•• '•• • .. ■ . - • i • 



he gave higher points for hierarchical classification than for the use of 
comparative methods. In the assessment of writing this would appear to be 

deliberate introduction of biased criteria into the scoring process. 

"\ , ' 

Contrary results have been reported. In a study of syntactic patterns of . 

* * • / 

lower and middle class Chicanos, Garcia (1975/76) con^uded that the / 
Chicanos used the same basic patterns found in American English, a conclt/si 

o j 

also tendered by Rodrigues (1978). At the same time, however, Garcia cited 
research* dembnstrating differences in the 1 morphological and phonological 
systems used by Chicanos and Anglos. 

Recent informal evidence demonstrates the potency of systematic differ- 
ences among raters of writing, Hartwell (1981) found that older, more fl 
experienced writers selected very different passages as exemplary of 

n . ■ 9 * 

"professional writing." than did college freshmen. The differences appear 

J ' ■ 4 ' ' * k 

to be consistent along a number of dimensions^ indluding content, coherence, 

degree of complexity, and development. Differences in rating of a written 

essay may also be. related to the rater's own level of cognitive complexity • 

and integration (Sterngl-ass, ,1981). Rater background has been found to 

influence how scoring criteria are interpreted and applied. FoTlman and 

Anderson (1967) concluded that when raters shared similar backgrounds with ' 

regard to education an4-opinions jtbout what constitutes good*writing, they 

tended *to agree on' the ratings of essays more than raters who differed along 

'these dimer^ions. 

Whether writing is assessed through normative-holistic means or through 
differentiated judgments on dimensions of rhetorical quality, the scoring 
"instrument" will always be a human judge. Consequently, no question about 



fairness, validity, or accuracy in writing assessment can be fully addressed 
without reference to possible errors in judgment. The intention of writing 
assessment is to generate information useful for diagnosis and/or remediation. 
When diagnostic utility is of interest several other issues are pertinent: 
.Diagnosis implies performance profiles which in turn require a multidimen- 
sional view of the writing skill domain. Questions about skill profiles are- 
connected intimately to rater behavior in assigning ratings. Scoring 
criteria are filtered through the expectancies, of raters**; and the halo 
effect inflates inter-subscale correlations (Jaeger and Freijo, 1975). 
The use of more and longer writing tasks only exacerbates this phenomenon. 
Rating scales may interact. It is common for writing score profiles 
to include some attention to essay "mechanics;" variations along this 
dimension .may influence ratings on other dimensions. Ratings assigned to 
a writing sample on such dimensions at "organization" or "use of supporting 
detail" may be assigned differentially depending on the quality>of mechanics 
within the essay. ■ For mechanically substandard work, this process might - 
bring the assessment of other dimensions of writing quality into line-with. 
the rater's impression of mechanics while if level of mechanics is not so 
. low as to call attention £o itself, there may be minimal confounding. 
However, .across a given set of papers the net effect would be correlated 
true and error components and concomitant inflation of inter-subscale 
correlations. In a multitrait-multimethod factor analytic formulation the 
expectation in general would be for negative correlations between mechanics* 
"trait" factors and ratings "method" factors. Quellmalz and Capell (1979) u 
multitrait-multimethod 'confirmatory factor analyses to examine discriminant 

.53 



validity of subscales generated by analytic scoring rubricsand the comparative 
information yield-of alternative fesponse^ modes for writing assessment (i-e., 
essay, paragraph and selected response). Their results indicated relatively 
high interconnections amopg subtfcalg content factors, as well as a general 
tendency for£he shorter assessment modes to generate less pure indicators 
of the subscale factors, * • \ 

If non-native English speakers 1 English waiting is easily distinguished 
from that of native speakers. ort the dimension of mechanics, and if suqh 
groufS differences contaminate other ratings assigned to non-native speakers, 

a straightforward fojrm of bias way be present. Rating^ on other dimensions 

will be systematically depressed, and the diagnostic utility of*the 

• 9 * 

writing appraisal undermined. The present study was conducted to evaluate 
such bias in the context of variations of ethnicity of both students and 
raters, and of prompts. Additionally the nature of the task presented to 
the students in order to get them to write an essay was varied systematically; 

( Method 
Subjects * 

e One hundred and thirty fifth. and sixth graders frdm monolingual English 
classrooms in a Southern California school district of moderate sjze were in- 
volved in this stuCty as a ncfrmal part of their classroom activities'.^ These stu 
dents were not members of^>iyngual- programs although some Were involved in 
'remedial "pull-out" instruction. Of the 116' students who provided complete 
essays, half were Hi spanic-suf named. Raters were four teachers hired 
during school vacation, of whom two were Hispanic and two non-Hispanic, 
These raters were from different school districts and had no other contact 
of any kind with the students in this sample; ' 



Instruments • ' • 

— — " ? 

. The*study used a standardized writing task with two topics, and a 
modified scoring rubric which has-been shown to have acceptable validity, 
and reliability (Quellmalz & Capell, 1979), explained shortly. The packet con- 
taining the, essay writing task consisted of a face sheet for students name and 
date, followed 'by two prompts and two lined response pages, totalling five 
pieces-of paper per handout* sThe prompts involved two topics, one a main 
street of 'a town and the other a robot. Order of presentation of the 
prompts, and whether the*prompt was written, or pictorial, was controlled 
for every participant. Written ^prompts involved five lines of typewritten 
text, while picture prompts involved a >le|d sentence and a full -page m 
line drawing of the topic for children by*la graduate student' artist. In 
both situations, the text concluded with the request that the student write 
a paragraph about the topic presented.' No other information was made 
available' to the student. 9 1 • , 

. The raters reviewed these essays using the Center for the Study ,of^ 
•Evaluation's Factual Narrative "scoring rubric, consisting of four primary 
- subscaljes— General Impression, Focus and Organization, Support, and Grammar and- 
Mechanics, Each of the^e was evaluated on a six-point scale, ranging from • 
clear mastery of the assignment to clear failure. For each of the six 
values on each of the four scales, extensive guidelines for scoring were 
provided. General Impression ratings of the essay is formed by considering 
all aspects of the effectiveness Of composition, including the remaining 1 
three-rating criteria, The Focus and Organization sttbsca-le handles such 
issues as logical progression, transitions and topic development^ The 



■ -10 

I 

Support subscale rates the use of specific supporting- statements and 
details. The Grammar and Mechanics subscale is used to evaluate the essay's 
sentence construction," word usage, spelling and punctuation. As Well as 
an overall rating -from this last subscale*, the extent of errors of each 
of the four areas of Mechanics noted above is rated separately. The in- 
structions of,the CSE scoring rubric are explicit that raters using factual 
scoring will, likely find that some qualities of an essay cannot be considered 
separate from others, but it is also quite direct in indicating'-' how any 
^particular rating is to correspond, to the annotation supplied in the 

i 

guidelines, 

Procedure ' * • 

Each 'child received one essay packet, containing two essay prompts— 
one ..pictorial arid the other written, and.ruled pages for the* child's essays. 
The package of essay prompts was administered in a single half-hour 
sitting by the children's .classroom teachers, and essays were collected . 
and sent-directly for rating without further intervention in the. 

classroom. ' " ' '• 

Each of the - raters was given ev^ry essay packet in random order, but 
without the face sheet and thus without idejitiftcation of the name or 
ethnic background of the student writers. Following five days of training 
and pilot testing, on use of the CSE rating scales, the four.faters completed 
scoring of. the-116 essay packages which were complete and legible over a 
seven day period. The resulting 32 ratings for each essay (four raters x 
eight subscales) were then analyzed by a three factor ana'lysis of variance 
(student ethnicity x rater ethnicity x prompt modality) with repeats.on 
the second-two factors .(.Winer, • 1962) separately for each subscale. Also 

... 62 



11 



collected from school district records were subtest totals on the Compre- 
hensiye Test of Basic Skills (CTBS), adminstered^as parto.f the regular, . 
testing program by the s ool district, for all students involved in the 
study. These scores allowed the investigation of possible relationships 
between the measures of writing capability .and four aspects of students' 
-intellectual capacity-'-vocabulary, passage comprehension, language mechanics, 

and expression. . * . 

. -Results and Discussion ' 

- Only essays with complete ratings were considered in the analysis; 
complete data was available for the four primary subscales/or 100 essays, 
and for the four detail subscales for 74 essays. Average rater agreement' 
across all subscales was high for the two Hispanic raters {9£l5«) and 
moderately good for the non-Hispanic raters (85.46%). When all four 
raters were compared, average agreement on the subscales was good (81. 15%K 
These values were considered as acceptable evidence that the .training of 
the essay raters had been satisfactory. To minimize potential confounding 
from differences between the two topics, all scores were -then .standardized 
within topic before further analysis. . ' - . 

On the general Impression subscale, the. interaction between- student 
ethnicity (Hispanic or non-Hispanic) and rater ethnicity (Hispanic or non- 
Hispanic) was significant (F ]L>g8 =6.51, MSerror = 13.37, p<.01). While - 
the non-Hispanic student- essays received about the same General Impression 
■ scopes fS^ispanic raters as the Hispanic student essays, the non*Hispani 
"raters significantly faVored the non-Hispanic student essays. No other 



12 ' • 



main effect or interaction was significant for this subscale. The inter- 
action between student ethnicity and rater ethnicity was also found on 
the Support subscTfle l? 1 9S =4.02, MSerror = 31*48, p<.05), and on the # 
Mechanics subscale (F 1}98 = 7.18, MSerror = 36.42, p<.01). On the Support 
subscale, the non-Hispanic student essays were again significantly favored ; 
by the non'-Wispanic raters. However, on the Mechanics subscale? the non- 
Hispanic raters judged both student groups alike while the Hispanic raters 
gave the essays of the non-Hispanic studenjts -significantly lower scores. . 
• For the Focus subscale-* a main effect\of rater ethnicity' (F- j 03- 

11.82, MSerror = 16,62, p<.001) and an interaction betwefen rater ethnicHy 
and prompt mode (picture prompt or written prompt) (F 1> / 98 = 6.41, , MSerror = 
19.01, p<.01) were found. In addition to the rater ethnicity by student 
ethnicity interactions, the' Support subscale yielded only a main effect 
of prompt modality (F x g8 = 10.43, MSerror'= 68.17, p<.001), and the Mechanics 

subscale yielded only a main effect of rater ethnicity (F 1 g8 = 13.45, 

, . * 

MSerror - 36.42, p<.001). On the detail subscales of Mechanics, only one 
effect emerged as "significant: rater ethnicity as a .factor in Usage ratings' 
(F, 7 o = ilioi, MSerror'47.0U,P<.001). No other detail subscale showed any 



7 



Insert Table 1 about here * * 
significant main effect or interaction. Table 1 summarizes the findings 
across the four primary and the usage detail subscales by main effect and 
interactions, and the results of post-hoc analyses*".' . 

i=T "~ When performance scores on the CTBS were compared, neither the Hispanic 
nor non-Hispanic students emerged as significantly more capable on any 
subscale than the others.' The results of the correlational study between 

64 



student essay ratings and the four selected scale scores from the CTBS 
can be summarized rapidly. Not a singly significant correlation appeared 
between a,ny rating subscale* and any CTBS scale for this sample. Thus there 
appears to be no intrinsically overlapping information^ between writing 
performance as judged on CSE's Factual Narrative rubric and a sample of 
academic performance as judged on a multiple-choice examination. 

The most important finding, repeated across three of the' subscales, is 
that the student ethnicity and rater ethnicity factors interact frequently 
and substantively in the appraisal of students' written essays. Addition- 
ally, rater ethnicity alone is a"Lso a significant factor in the ratings. 
These results point to three conclusions. Fir$t, the evaluation of prose 
writing seems** to be systematically affected by factors which reflect 
•different cultural backgrounds. It is important to note that this effect 
does not emerge when essays are grouped solely by student ethnicity; 
ra'ther, the students of one or the other backgrounds were often judged 
differently by. raters who share that background than by raters who do not. 
Second, these factors include (but are not limited to) a match or mismatch 
between raters' and writers' preferred language styles, and to some extent 
the "nature Of the stimulus used to initiatl the writing sample. Note, 
however y that' the three factor interaction between .student ethnicity, rater 
ethlTicity, and type of prompt was hot observed for any of the sub- 
scales used. Third, the phenomenon of systematic matching or mismatching 
of preferences and styles occurs -despi te. the fact that the evaluative scheme ^ 
used is one with a high degree of objectivity,. which would be expected, to -3 
minimize such matching relative to riore subjective rating scheme. The nature 



r 

V 



14 

of the judgment, task is v referenced point for point'by the CSE scoring 
rubric and thus no scale-free or endpoint-only continuum judgments were 
involved. Additionally, because raters were blind not only to^the names 
and ethnicities of the essay writers,*but to the study's hypotheses and 
the proportional representation- of ethnicities within the sample, what- 
ever matching occure^ most likely stems from recognition of and preference 
for certain subtle aspects of writing styles. % A 

Sbme limitations of the present study deserve attention. There are 
many possible secondary analyses of writing style, process and content which 
have not been pursued here. No information about essay complexity or 
other linguistic' patterns is available from the present analysis. How 
creative'* stereotyped, or bizarVe the particular essay is goes unremarked 
in the CSE scoring systeml The isolation of exact details within essay 

I 

content o§ specific preferences of individual raters was not with the 
purview of thts investigation. Moreover, there is a small possibility - 
that systematic differences in handwriting mastery contributed to the 
recognizability of student ethnicity and^thus to the ratings given, but 
this was not examined directly. None of these considerations is seen as 
critical to the interpretation -of- the results presented above, -in particular 

8 < 

. s ■« ^ 

because the expected outcome of the analyses of variance in such instance - 
would necessarily be a main effect due to student ethnicity, alone or a 
three-way interaction between student ethnicity, rater ethnicity, and prompt 
modality. None of these effects ,emerged;in the present study, but rather 
a pattern of findings- which strongly. suggests that some /fomp lex form of i 

bias is at wortc. * ' . k 

• . * " • , 



Btas in judgment. is a" phenomenon which obtains under a variety of 
circumstances, some oi; which are intrinsic in the testing and evaluation 
process. The present findings indicate that extrinsic factors-must. also 
be considered. In the case of judgment of essays, where essay content has 
virtually limitless possibilities and appraisal of necessity is at" least, 
partially subjective, the opportunity for unintentional bias seems more 
likely. Forthe teacher or essay test administrator seeking to limit bias 
to the absolute minimum, the mandate is: 'those who are to 4 perform the 
rating of the essays must be matched for appropriate backgrounds of the 
students who write the essays to be 'judged. • 



67 



V 




17 



Rissel 



domain fin Spanish and English bil.inguals. Bilingual Review , 1378, . . - 
2* '29-34. 

Rizzo, B., & VUlafane, S. Spanish language influences on written English. * 
Journal of Basic Writing , 1978, _1, 62-74. , 7 \ '' 

Rtfdrigues, R. A statistical study of the English syntax of bilingual 
Mexican-American and 'monolingual Anglo-American -students. Bilingual 
Review , 1978, 3, 205-211. * . ^ 

Smith, F. Spoken and written language. In Lenneberg, E.H. &^.enn!berg, E. , 
(Eds.) "Foundations of language development, a multi disciplinary approach . 
New York, Academic Press, 1975. < „ ■ 

Stahl, A. The structure of children's compositions: Developmental and 
ethnic differences. Research in the Teaching of„ English , 1977, 11, 



Stefnglass, M.§» Assessing reading, writing, and. reasoning. College 
English , 1981, 43, 269-275. , - 

o 

White, E.M. & Tfcbmas, L.L. Racial .minorities and writing skills assessment 
in the California State University arid Colleges, College English , * 
1981, ! 43, 276-283. . ' ' 



Winer, B.J. Statistical principles in' experimental rdesign . New York, 
McGraw-Hill, 1962. V~ 




156-163. 





* . " .Table 1 

v • Summary of statistically significant (p<.05) effects 

♦ • i 

9 " • 

* J 



v Subscale: 


9 

General 
3 Imnre^ion 

XIII 1 w J O IV/II 


I 

' Focus* and 
Organization' 


Support 


Mechanics 


. Usage 
detail 


' ' N= 


100 


^100 

* 


100 


100 


74 . 


» 

Main Effetts. , 






• 






Student 
Ethnicity 

Rater . 
Ethnicity 


<* 

} 

* 


2 

* 


• 3 
* 


2 


■ * 2 


Prompt - . . 












Interactions 


» • • •> 




* 


5' 

* 




' Student x- Rater 












Student x Prompt 




6 








- -> Rater x Prompt 


<* 

**• 


* 

** > 








* ^ Student x Rater 
x, Prompt 

c4 * 


f * 

, 1 4 / r 











Remaining detail subscales show'no^ significant effects. 
2 Hispani c raters elevated relative to non-Hispanic raters." 
*" 3 Picture prompt elevated relative to .written prompt. ' • * . 

/^fewHispanic raters + nonVHispanic student* essays elevated relative to other 
•combinations. * 

.* 5 Hispanic raters + non-Hispanic student essays depressed relative to other 
^ combinations^ ' 

'^on-Hispanic raters + Hispanic student essays elevated relative to other 
» combinations. " * 



Performance Patterns of Bilingual "Cnildren Tested in Both Languages 

« . - 4 

David L. McArthur 
Center for the Study of Evaluation 
^ ' University of California Los Angeles 



Supported by a grant from the Nation^ Institute of Education 
V • ' (NIE-Q-$(W)012) 



Abstract , 

. /The testing of bilingual students poses particular problems for arialys 

of performance, item bias and test adequacy. When children are selected 

for their facility in two languages, and the same tesj; 44 administered 

in both languages; a special arena is provided for the study of these 

problems. A widely-used test,' the Comprehensive Test of Basic Skills, 

x \ ' < 

is available in both English and Spanish.-- The vocabulary subtest was 

* • v 
administered to 1162 second-graders in bilingual education programs 

throughout the Southwest , as part of a larger study; 58 of those students 

received both versions of the'test because they were deemed equally 

proficient in both languares. Results show that patterns of performance 

for these students, differ markedly between the two- versions, and suggest 

that the" test differs in important dimensions even though the Spanish 

version is. cl rather faithful translation of the English original. 



Severe problems confront the evatuation -of bilingual program students 

from the standpoint 6f both individual performance measurement and the* 

pbtential for bias in testing* Assessing the student in the majority 

language runs one set of risks; assessing in the native tongue runs another. 
« * 

Th# number of studies* which have successfully assessed a single skill in 
two languages for the same individuals is exceedingly small (Duran, 1980)\ 
Resolution of these 'problems is not aided by the current controversy 
surrounding both the definition and measurement of bil ingual ism itself 
CDe Avila, 1978). Moreover, thoroughly contradictory findings emerge 
from studies of the acquisition of French by native English-speaking 
children in Canada (Lambert & Tucker, 1972), of Swedish by native Finnish- 
speaking children in^candinavia (Skutnabb-Kangas &-Toukomaa, 1976), and 
of English by native Spanish-speaking children in the U'.S. (Fischer & 
Cabello, 1978). The integration of such differences may rest in part on 
linguistic, developmental', and/or sociocul tural interpretations (Troike, 1978); 
a practical level of shared bilingualism or dominance of one language- over 
the other in the community may also pla# a strong role (Laosa, 1975). ' r 
Finnish-speaking children from the populous, southern districts find, and 
potentially model, both Finnish and Swedish in almost every shop window, ' 
while the politics Of separatism are explicit in Quebec and de facto in y 
many, areas of the American Southwest, so children from' these regions may 
encounter the second language with mixed emotions. Assessing even a 
relatively simple arena like vocabulary skills becomes multiply compounded 
when dealing with students who must cope With two languages. 



Measuring the skills of bilingual program students necessarily also 

means assessing whether tests developed for the monolingual -English- student 

are appropriate for making decisions about bilingual .or limited-English 

proficient students characteristically found in such programs, and of 

minority groups who tend. to be overrepresented there. Some educators 

believe that many tests are- intrinsically unfair to minorities Because 

the values they 'reflect are those of the majorfty only (Cervantes, 1975). 

Others, however, hold that tests of culturally defined content and voca- 

bulary are not biased -because achievement itself is language ayd culture 

specific (Ebel, 1975)1 But the impetus for testing continues: 

The problem how becomes not whether to test bilingual students, 
but rather how to do it. in a manner that accurately assesses 
their specific abilities and in a manner that does not create a 
bias either against- them or in favor of them (Cooper, 1978, 
.p. 2, italics original). * , 

We turn attention specifically to assessment in Spanish-English bilingual 
programs at the primary level, and encounter two factors which strongly 
mitigate against simple effective solutions to. the problems noted above. 
The first is that exceedingly few instruments are available at present 
which are both culturally appropriate and technically sound for this - 
purpose. !'The problems are particularly acute with respect' to, English * 
language measures, but are often equally pervasive in instruments that are 
simply translations from English language versions". (Burry, 1979, p. 8), , 
The second is that Engiish-language-instr^iction in reading, listening 
comprehension and vocabulary may be intrinsically more, difficult for. native 
Spanish-speaking children than for their native English speaking counter- 
parts because of the increased tjhythmic and- phonological complexity of 



English, Fundamental linguistic skills for understanding Spanish are 
frequently inadequate for comprehending English. -Even a relatively simple 
phrase like "I c'n take it home fer ya 1 ' (/^yknteyklthowmf^ry^/ for the English 
listener) is lively to be heard by the native Spanish-speakiha, child as 
/'aintekromfia*/, resulting in the obliteration of six out of seven words 
in the sentence (Matluck & Mace, 1972). , The quantity of purely linguistic 
differences between Spanish and English suggest that the Spanish-speakjng 
child is at no small disadvantage; especially in the prima*^ grades , 

appropriate'language skills testing must not ignore such difficulties. 

i 

The Comprehensive Test of Basic Skills/Spanish (1974/1978) , is in 

large measure a direct translation of its Encflish counterpart, which has 

been widely used as a primary*skills evaluation tool. The CTBS/S has 

been presented as a major attempt' to meet the needs of ijative Spanish- 

* * 

speaking children (Finch, '1979) . With such a t&st, the teacher ^can select 

the language appropriate for* a child with some assurance that the instrument 

* 

is valid, reliable and unbiased (Hoepfner & Christen, 1979). Thus, the 
, CTBS and CTBS/S should provide a good vehicle ^to examine individual per- 
formance patterns in either language for students in bilingual programs. 
However, recent evidence bas.ed on the performance of. English- and Spanish- 
speaking pupils suggests that the tests contain multiple sources of bias 
(McArthur, 1981), so a particularly interesting situation for research 
obtains when both versions of the CTBS are administered to the same children. 
That iSjJf a group of children who possess similar levels of knowledge 
,in both English and Spanish are tested on both instruments, will individual 
performances be the same across the two? Will the results of such dual- 

.74' * 



ERIC 



language testing reflect patterns which can be interpreters the direct 
result of item bias? Will direct translation hold up as a viable strategy 
for fair testing of primary, pupils in Spanish as well as in English? 

* » 

Methods 

i * 

Subjects 

As part of a larger study (CSE, 1979), almost 1200 children in bilingual 
education programs in 26'school districts spread over five southwestern 
States were administered a series of educational achievement tests by 
their teachers. Programs were designed to provide instruction in reading 
and mathematics at the upper primary level. Teacher reports from these 
programs indicate that the time spent using Spanish as the language of 
instruction was approximately^equal to the time spent using English. 
Ninety-three percent of the program- teachers had' earned at least a BA or BS; 
94% were full-time employees'of the school district, 'and 88% had prior 
experience in bilingual education. Assignment. of students to these special 
programs relied primarily on teacher evaluations and language dominence 
tests.. ..Achievement tests were infrequently used fco determine remediation 
placement, ' arldSntelHgence tejst scores were generally excluded altogether 
'•from placement considerations. Thus the programs represented a major effort, 
competently staffed, to provide special attention in a bilingual setting to 
student educational needs.. Most of .the students .were rated by their teachers 
as having some skills in both English and Spanish. Overall only one child 
in ten from these classes was considered monolingual Spanish while only one-: 
in nine was rated as monolirtgual English. - , 



75 



Instruments * ' 

While a large number of instruments were used in the investigation 

of programs, only the CTBS is of concern in the present stu;dy. It was 

selected because test* content between the two language versions is virtually 

identical. The CTBS-Spanish was the first test by a major publisher to be 

subjected to a four-step editorial procedure designed to reduce bias; 

included were studies of content validity, application of editorial 

guidelines in. item construction, reviews for bias, and separate ethnic 

group pijot studies. The developers of the Spanish-language version tried 

to keep, the test content and measurement features intact, thus building a 

test which vvas similar in rationale, administration and interpretation to 

its parent version in*English. What differences exist are'the result 

i 

primarily of problems of literal translation. 

The children in the study wpre given a /Urge number of standardized 
tests-of achievement during the cburse of the regular school ^ear by their 
teachers.! With regard* to the CTBS, the important instruction made to 
teachers was that they decicfe in advarfce on an individual basis whether 
each child would receive the English-\anguage or Spanish-language vers^po 
of the test. This decision was left totally to the discretion ancj b$st •* 
judgment of the classroom teachers. A t,otal of 1162 compteted test forms 
were V returned, 814 in English/and 348 in Spanish. 'Fifty-eight '"students 
in the sample were found to have "been tested in both languages; "that is, # 
'one student i» every nineteen was given both ftfrms of the test because the 
teachers felt unable to distinguish in advance which, language these .students 
should be tested in. No evidence.. is available to suggest that any selection 




76/ • 



bias or other external circumstance might have contributed to obtaining 
this sample. Order of adminstration was apparently random. For purposes 
of this report, only the Vocabulary subscale of test level C, consisting 
of 33 items selected in response to the teacher's verbal directions? is 
considered. 



Methods of analysjs 

Two techniques for analysis of response 'patterns were utilized in ■ 
-this study. The first relies on the work of Sato (1980) and colleagues in 
Japan; they have generated a systematic method of appraisal of test per- 
formance based on the S-P (Student-Problem) Chart, a matrix of right and ^ 
wrong answers, coded 1 or 0, for each respondent for each item. The N x n 
matrix has the additiona^l characteristics that, students have, been sorted % 
by descending total score and items have. been sorted by increasing difficulty. 
' Thus the, top row of the S-P Chart is a representation of the pattern of 
correct and incorrect responses to this sample of items by the most capable 
student in'the group, v the bottom row by the least capable. The left-hand 
column shows the pattern of responses to the easiest item in the set. of ; ^ 
items, 'and' : right-hand column, shows the most .difficult. From this matrix. ♦ 
are generated two statistics,* one related to the group pattern for the group 
' as a whale, the "other- related to individual performance Vis-a^vis both the 
.group and the configuration of ttmes, for each individual.' The first is 
'an "index of discrepancy," D*„ which ranges from 0,00 for a matrix of 

• perfect *symmetryj)etween student capabilities and item difficulties, to 

• V.00 for a matrix representing, exclusively -random responding, 1 The second 
js- a Caution ,1ndex>» c v whic^ ranges' from 0.00 .for an individual Mhose 

' response' patff/n'u-per^eclty fitted to that reflected in the order of item ^ 



difficulties as determined by the' group, to LOO for an individual whose 
pattern of responses is total -antithetical to the order of item difficulties , 
and thus 'is quite unlike the. representative average respondent in the-group. 2 
The second an&lytic tool used in this study is a statistic from Goodman 

and Kruskal called lambda} v/hich lias been applied elsewhere to the detection 

9 

of differences in response patterns in testing (Veale & foripan, 1976). Here 
the focus is on ^differences between groups in the attractiveness of incorrect 
responses within the multiple-choice format of one correct and three, incor- 
rect responses per item. Lambda is an index of the pattern of chpice' for 
-the incorrect resports^s^^If the value of lambda is 0.00, the two groups 
use about the same pattern of selection of the incorrect responses. As 
the value increases, one group is using a different 'strategy for selection 
of incorrect responses than the other. The computation of lambda is inde- ' 
pendent of the actual proportions within each group who select ihe ^correct 
response to the item. Irf this paper, value's of ^ambda above .10 are * 

considered noteworthy. 3 r • - 

■ /• * * 

•" Details of the computation and use of the^approaches in the context • 

of testing and iteim,bias detection research have been set out elsewhere 



The 



(McArthur, 1981). The usual test-retest and reliability statistics are Hot 
appropriate here, because of the attention \o deciphering specific perfojj- 
mance patterhs rather .than whole-'group performance. . t 
Hypotheses * - 

•Because pf pr° cess of respondent selection , % specif ic hypotheses about, 
their performance off the English-language and Spanish-language versions of 
the Vocabulary subtest were, first, that the achieved scores between tests 



78 



would be perfectly correlated Additionally, the'S-P charts for the two 
versions would be similar, as shown by equal indices of" discrepancy, D*. 
At the level of the individual respondent, it was hypothesized, that the 
achieved total score in^ English would equal the achieved total in Spanish, 
and that the caution index generated for each individual in the English- 
lanugage-S-P chart would be equal to the cautjon index obtained 'by the 
same individual from the Spanish-language S-P chart: 

V 

Results 

Total scores on the English-language Vocabulary subtest averaged 

75.34% correct with a range of 6 - 33. On the Spanish-language version, 

the average was 37.56% correct with a range of 4 - 25. /The total scores 
are significantly (p<.05) correlated, r = .48. Median improvement from 
Spanish to English Is 13 answers correct. Only three of the 58 participants 
did not show improvement in their total scores from Spanish to English. ^ 

Two of the 33 items yielded higher percentages of correct response in 
'th\ Spanish-language version* than in the "English. For the remainder of the 
items, students were able to select- the correct response less frequently * 
in the Spanish-language version,- often .by substantial margins. ^ The ratio 
of Spanig| correct to English correct for each item is shown in the first 
. column cjbble 1. The consistency with which students picked the correct 

Insert Table 1. about here 
answer fn both languages ranged form moderately high (653Uof the respon- 
dents, Chose the\correct answer to item 8 in/both language to very low (only 
7% chosfr-the correct response to ,i tern 31 in both languages). The consis- 
tency of selection of incorrect responses. was generally extremely low,^ . 



reaching 14% for items 24 and 31. The proportions of joint correct and 
joint incorrect proportions are shown in columns 2 and 3 x>f Table 1? 

Those incorrect answers to items which, garnered at least 10% more 
responses than the next most frequently chosen incorrect response were 
termed "popular distractors." Three popular distract&r items were<found 
in the English-language version, while twelve were fburid in tt^e Spanish. \ 
The average percentage of respondents who chose tnte^xcrrect answer to 
an item in English but were swayed' to choose the popular distractor 
(incorrect) response to that same item in Spanish was 35%. The reverse, 
% choosing a popular distractor response in English although selecting the 
correct response to that same item 1n Spanish was 30%. Whether a specific 
item contained a popular distractor, and if so the percentage of respondents 
correct on the same item in the other language but who choose that popular 
distractor, is indicated in the next four colum/is of Table 1. 

* The data to this point quite clearly indicate that the Spanish-language- 
version of the CTBS presented a far more difficult task for these responr. 
dents than did the Englisfi-language version. Onl&jrfrequently xiid any , 
vocabulanMtem from one version have^both an equal percentage of incorrect 

selecMons. Examination* of the'S-P charts is necessary to show whether the 

f ' 

difference in performance patterns is systematic. * 

The Spanish-language version generated^ D* of .53, a. relatively high 
level of randomness of responses, while the English-language version yielded 
aD* of .24", reflecting a much more orderly fit ofsubject capabilities to 
item difficulties. No exact te§t of significance exists for the! size of, 
or differences between, -D* values, but in this instance they represent 



4 

e 

configurations of the S-P charts which are distinctly different visually. 
The difference is supported by refference to the caution indices which for 
individual respondents to the English-language version averaged .17, but to 
the Spanish-language version ,25. That is, on average the respondents' 
were more 'consistent in selecting correct answers to easy items and in-^ 
correct answers to difficulty items in the English-langOage version. In 
fact, the number of respondents with caution indices of 0,00 is much 
higher in English. 3f particular interest is that the correlation between 
the two indices computed across the 58 participants is nonsignificant, , 
Changes in caution indices from one language version to the other are 
uncorrel&ted-. 

*The computation of lambda, which details differences in'seTqctfon 

* . , - v ' ' 

patterns for wrong answers*, showed that twelve out of 33 items had large 

discrepancies in the obtained .conf igurati 0 n. That is, for a large number 

•«••*_ . . " • , 

of items, the respondents shifted their choice from one incorrect answer 
to another across language versions, rather than picking the same incorrect 
responses on both occasions. The- last" column of Table 1 indicates those 
items with such shifts in incorrect answers.. 

Discussion ; 
The findings' of/this study in general' compart with .earlier research ■ 
on the CTBS in English and;Spanish using independent groups ~of bilingual 
program respondents (McArthur, 1981). The distributions of total subscale 
scores, the higher D* indices "for the Spanish-language. version, the number 
of popular detractors and of lambda values exceeding .10 are all similar. 
That the two" versions of the test do not- produce equal outcomes even when 
the actual respondents are identical seems clear from the present data. 



* • * 

> 11 

. ' / • ■ 

I 1 , • 4 ^ * 

If there was to have been. equivalence of total subscale scores, of group 

or individual patterns of correct scores, or of selection of wrong arises 

between the English- and Spanish-language versions, the number of discre- 

• » * 

pancies emerging^ from the statistical computations would have been far 

sma'ller. In its present configuration „ these data. suggest that chi-ldrerr 

do not show the same performance patterns. in response to the two versions 

of the test- Review of data contained in Tabla 1 suggests- that many 

of the items may be suspected of somehow biasing the choice of correct 

response, and that such potentially biasing items are more prevalent in 

-the Spanish-language version. , 

The relatively small number of individuals represented in this study 

* 

'makes these res ultsjieces sari ly tentative.: they are presented neither as* 

\ . j -■ 

a representation of majority vs! minority responses to a specific test, 

nor as an indication in any way of a measure*of true ability among b-ilin- 

gual tfrogram students. Rather, the unusual trial of ra purportedly decent 

test in two languages, a purportedly efqual -ability student sample, and a 

classroom experience for that sample equally divided. into the use of , 

'English and Spanish, demands thoughtful attention to the appraisal of 

testing. In the present investigation, one weakness .is the absence of an 

independent and unambiguous assessment of bilingual capability, and the 

ensuing reliance on the accuracy of teacher selection of students equally 

competent in tow languages. DeAvila and Duncan (1978) have pointed out 

numerous shortcomings in teacher ratings of" language competence. -However i 

for this study, .students were not drawn for their equally high abilities. 

or for the purposes of assembling a -homogeneous sampel , but only for their 



language abilities tp be equally' high- or\ low in both' languages. Nothing 
• * 

is known about the relative levels of exposure to English or Spanish 
outside the school, nor kbout the relative strengths and weaknesses of 
the texts in bojth languages used in the program. However, the teachers' 
clas$ personal supervision of studen]^ and the even division between 
English and Spanish as the language of instruction in these programs suggest 
that the childrens' levels of readiness for vocabulary would be" roughly 
similar. Another weaknesses- the relatively small number ,of items included 

4 

in this investigation.' However 2 , the CTBS appears to represent the state 
0/ the art in English/Spanish testing of vocabulary skills at this level,, 
and no other instrument is known to be a closer approximation tQ -neutrality. 
The present results support the contention that the method ^of direct #r - 
translation from English to Spanish' for bilingual vocabulary testing may 
not be fully adequate for the tyeeds of the bilingual program student. 



13 

References 

Burry, 0. Evaluation in, bilingual education. Evaluation Comment , 

Center for the Study of Evaluation, Los Angeles, 1969, 6_, 1-14. 

Cervantes, R.A. Self-concept, locus of control,- and achievement in 

Mexican-American pupils . Unpublished doctoral dissertation, 

* 

Union Graduate School -West, San Francisco, 1975. 
Cooper,, E. Test selection in bilingual education evaluation . P.aper 

presented at the Annual Meeting of the American Educational Research 

Association, Toronto, 1978. „ ^ , 

Comprehensive Test of Basic Skills (CTBS) Examiner's Manual and Espanol 

Examiner's Manual / Monterey, CTB/McG raw-Hi 11 , 1974/1978. 
Center for the Study of Evaluation: . Final Report: Basic Skills 

Learning Centers evaluation . Los Angeles, UCLA; 1979. ' , 
beAvila, E. "&. Duncan, S.E. Definition and measurement, the east and ^ 

west of bilingual ism , larkspur, California, DeAvila, Duncan & Associates > 

mimeo, 1978,' ■ / • 

Duran, R.P. Bilirrquals'; skill in solving fogical reasoning proTjlems in 
x ' two' languages. 0 Princeton: Educational Testing Service, 1980. ' 

(ERIt, Document Reproduction. Service. No. ED 198,724). 

'£bel^ R.L. 'Constructing .unbiased achievement tests . Paper presented 

v aYthe National Institute-of Education Conference on test bias, 
•'•*.! ;\ , ' . 

' Baltimore; 1975... ... ' 

i '••*.' 

Finch, F.L. /At last: * Spanish- version pj*<TBS. Paper presented to the 
' * • *' * \ 

California Association "of Bit-fngt»aT Educatgrs, Fresno, 1*79. 



y 

/ - 84 



14 



.,' Fischer, K.B; & Cabello, B> Predicting student succes s 'following transition 
, from bilingual programs . Paper presented at the Annual Meeting of the' 
American. Educational Research Association, Toronto,, 1978. 
Hoepfner, R. & Christen, F. Measures of academic growth . Santa Monica: 
System Development. Corporation, mimeo, 19^9.- 
Lambert, W.W. & Tucker, G.R. Bilingual edu cation of children: the 
*' St. Lambert ex^negcej- Rowley., Massachusetts: Newbury House, 1972. 
" Laqsa, L.M. B\liftgualism in* three United States Hispanic- groups: contextual 
use of language by children and adults in their families. Journal 
of Educational Psychology , 1975, 67, 617-627. 
Matluck, J.H. &Mace, B.J. Language characteristics of Mexican-American 

• • * 

children: implications for assessment. Journal of School Psycholog y, 

- " 1973i*ll, 365-386. * 
^McArthur, D.L. Detection 'of item bias using analyses; of response patt erns. 

» \ * ^ : ; 

; Los Angeles, UCLA, mimeo, 1981. 
Sato, T. ,The S-P char£ and the caution. index. NEC (Nippon Electric 
Company, Japan),- Educational'' Inform attcs Bulletin, 1980. 
. Skutnabb-Kangas, T. & Toukbmaa, P. Teaching migrant children's mother 
, * tongues and learning the language of the host country in the context 
- of the socio-cultural situation of th e migrant. family. Helsinki, 
/Finnish National Commission for. UNESCO,- mimeo, 1976. 
, Troike, R.C. Research evidence 'for'tHe e ffectiveness of bilingual; 
"education. Washington, D.O., National, Clearinghouse for Bilingual 
Education, mimeo,- 1978. ' « 

Veale, J.R. & Forman,D.I, Cjntura] variation in 'cri teriorr referenced Q , 
tests: a global item analyses' . Paper presented at the Annual fleeting 
of the American Educational Research Association, San Francisco, 1976. 



Table 1 



Summary of Findings for the CTBS and CTBS/S 



15 



item 
number 



ratio or 
Spanish 
correct to 
English 
correct 



joint 
correct 



Joint 
v/rong 



. popular, 
distractor 

English 'Spanish 



% who 
move from 
correct in 
one lang. 
to popular 
distractor 
in the 
other 



S to E 



1 


:45 


42 


0 




2 


* . 32 


32 


4 





3 


.64 


47 


'0 




4 


.63 


60 


2 




5 


".60 


42 


2 


"~ """ 


* 6 


.74 


56 


■A" 


"~ "~ 


8 • 


.62 


' 47 


9 




.73 


65 , 


. 0 


— 


9 


.64 


46 


0 




10 


.37 




o 




. 11 


.29 • . 


25 


n 
U 




12 


" .37 


.30 .* 


0 " 




13 


.25 


19 


4 

'4 




14 


.53 


35 




15^ 


,46 f 


30 


9 ' 
T 




f6 


.11 f 


9 




17 


.13 


9 


. 4 




18 • 


.67 


54 


4 


a 


19 


.23 


18 


9 




20 


' .63 


49 


2' ' V 




21 


.22 


16 . 


5 • 




22 


1.06 


37 


9 / 
5/ 




23 


.77- 


44' 




24 


.55 


-19 


14 


t 


25 


•-,.53 , 


23 . 


5 




26 


.23 \ { 


12 


9 


♦ 27 


.51 


33 


4 




1 28 


.37 


12 


4 




" 29 


\36 


5 


9 


yes 


30 ' 


: .51 • . ! 


'30 


- 7 


yes* 


31 . 


'^40 • 


7 


■14 . 


yes % 


. 32 


■ .70 'I 
. 1.41V ' 


- 42 


" ' 9 


*** < 


33 




11 






. • \ 




V 





yes 



\ to S 



32 



yes 
yes 



yes 

yes 
yes 

yes' 



i — . 



| yes 



yes 
>yes 
yes 
yes 



r- 



56 
5 

30 



17 
12 



31 

s 

54 
40 



43- 



60 
28 
48 
'19 



lambda 

greater 

than 

.10 



yes 



yes 
yes 



yes 



yes 
yes 



yes / 



w yes 

yes 

yes 

yes> 
" yes 



* Dp 



D* = 



16 



Footnotes 



A (N.n.p) 



where the numerator is a discrepancy 



A B (N,n,p) 

between cumulative probability ogives obtained from the 
S-P chart , and the denominator is an analogous discrepancy 
as modeled by cumulative binomial di stri but ions , both with 
the same number > D< 1 cases , number of items, and average passing 
rate. ' (Sato, 1980). 5 



where the numerator is the co variance over 
covtujj, Yj) 

problems of the i-th student's score on the j,-th problem with 
* ~~ 

the number of students who correctly answer that j-th problem, 
and the denominator is the coVariance over problems of the 
i-th hypothetical ideal student's score on the j-th problem With 
the number of students who correctly answer that j-th problem 
(Sato, 1980) > 



X = 



I ma ,*- f ,jk - max.f.k where max f is tne i arge r frequency 

N - max.f k » ■ ■ C 

of the two groups for any Single wrong choice-, max.f ^ is the 

larger marginal frequency of the two groups acros- all wrong 

* " • .' 

choices, and^N is the total number of observations. 



\ 



ERIC 



